The 30-second definition
Autoscaling:
Automatic right‑sizing of compute power to match real‑time demand
Horizontal scaling:
Adds or removes extra copies of an app (pods, services)
Vertical scaling:
Gives an app more or less horsepower (CPU, memory) without changing the copy count
Why you should care
Without
Autoscaling
30–40% of nodes idle
Cold‑starts & missed SLAs
Manual HPA tuning
With
Autoscaling
Up to 40% lower spend
99.99% latency compliance
Zero config drift
How Autoscaling works
1. Metrics Source: Traffic, queue depth, GPU utilisation, or business events.
2. Decision Engine:Rules or ML decide how many replicas or how much CPU/RAM you need.
3. Orchestrator Action:Kubernetes (or another platform) spins pods up or down.
4. Feedback Loop: Keeps checking so you never over‑ or under‑shoot.
Types of Autoscaling:
Kubernetes Autoscalers
Horizontal Pod Autoscaler (HPA)
Scales:
Replicas
Best for:
Steady workloads
Watch-outs:
Reactive → delays
Vertical Pod Autoscaler (VPA)
Scales:
CPU / RAM limits
Best for:
ML & batch jobs
Watch-outs:
Restarts pods
Event‑Driven (KEDA)
Scales:
Any metric or event
Best for: Spiky queues, GPU
Watch-outs:
Needs metric adapter
Cluster Autoscaler
Scales:
Nodes
Best for:
Infra‑level savings
Watch-outs:
Slowest to react
Autoscaling Strategy
Autoscaling
Type
Resource-
based
Custom-
metrics-based
Event-driven
HTTP-based
Scale Trigger
CPU / memory utilization
Any user-
defined metric
Queue length,
pub/sub
Request rate/ concurrency
Best for
Long-running, stable services
Business- or domain-specific goals
Spiky, asynchronous workloads (e.g. GPU jobs)
HTTP/gRPC APIs with unpredictable spikes
Key Benefit
Built-in, zero extra setup
Fully flexible
to your KPIs
Near real-time reactions
Real-time scaling for web traffic
Main Drawback
30s+ metrics lag; coarse granularity
Requires metrics pipeline & adapter
Needs adapter for each event source
Requires complex orchestration
Autoscaling Strategy
Autoscaling
Type
Resource-
based
Custom-
metrics-based
Event-driven
HTTP-based
Scale Trigger
CPU / memory utilization
Any user-
defined metric
Queue length,
pub/sub
Request rate/ concurrency
Best for
Long-running, stable services
Business- or domain-specific goals
Spiky, asynchronous workloads (e.g. GPU jobs)
HTTP/gRPC APIs with unpredictable spikes
Key Benefit
Built-in, zero extra setup
Fully flexible
to your KPIs
Near real-time reactions
Real-time scaling for web traffic
Main Drawback
30s+ metrics lag; coarse granularity
Requires metrics pipeline & adapter
Needs adapter for each event source
Requires complex orchestration
Common Pitfalls
Cold‑starts vs. warm‑starts
latency spikes when new pods
spin up.
Over‑provisioning “just in case.”
burning money on idle nodes.
Slow, reactive resource scaling
CPU/memory metrics lag behind real demand, causing delayed scaling and misaligned capacity.
Thundering herds & API rate limits –
bursts of workers hammer
downstream services.
Observability blind spots
scaling in the dark without live metrics.