Scale on the right signals. Use HTTP RPS/concurrency, backlog age, and tail latency; not just CPU.
Push beats scrape: wire OpenTelemetry into autoscaling to cut the “lag chain.”
Master all types of autoscaling on Kubernetes, safely scale out workloads and schedule jobs.
Preview to Predictive autoscaling and why we should use it.
GPU‑aware scaling: blend inflight intent with VRAM/SM headroom; hide cold starts.