Problem:
LLM inference / AI pipelines are GPU‑heavy, bursty, and expensive to keep warm.
Kedify solution:
GPU‑aware autoscaling and OTel‑based signals scale on real usage (RPS, concurrency, custom metrics), then scale down (including to zero when appropriate).
How it works (example signals):
- HTTP/OTel for request rate, concurrency, or token throughput
- PRP (vertical right‑sizing) to shrink warm pods when idle (alternative to replica=0)
Problem:
LLM inference / AI pipelines are GPU‑heavy, bursty, and expensive to keep warm.
Outcome:
20–40% GPU savings and lower variance in latency under burst. (From Kedify’s Product Overview and OTel LLM guide.)