What You’ll Prove

The POC focuses on measurable autoscaling outcomes: cost, latency, utilization, and operational effort on a real workload.

GPU autoscaling

Lower idle GPU capacity

Validate GPU-aware, event-driven scaling for inference or fine-tuning while keeping P95 latency predictable.

Target: 30-40% lower GPU spend

Kubernetes autoscaling

Clusters that follow demand

Exercise predictive policies that scale Kubernetes clusters dynamically across cloud environments.

Evidence: cost and performance deltas

HTTP/gRPC endpoints

Traffic-aware application scaling

Scale services from request pressure and bursty traffic while preserving latency SLOs.

Evidence: latency and replica behavior

Built-in visibility

Operational proof, not a guess

Use Prometheus/OTel metrics, long-term storage, dashboards, and readouts to show the impact clearly.

Output: readout and rollout plan