25: Horizontal Pod Autoscaler
25: Horizontal Pod Autoscaler
Objective
Learn how the Horizontal Pod Autoscaler (HPA) automatically scales the number of Pod replicas based on observed CPU utilization or other metrics. Deploy a CPU-intensive application, create an HPA, generate load, and observe automatic scaling behavior.
Theory
What is the Horizontal Pod Autoscaler?
The Horizontal Pod Autoscaler (HPA) automatically scales the number of Pods in a Deployment, ReplicaSet, or StatefulSet based on observed metrics such as CPU or memory utilization. It is one of the core autoscaling mechanisms in Kubernetes.
Key concepts:
| Concept | Description |
|---|---|
| Target metric | The metric HPA monitors (e.g., CPU utilization at 50%) |
| Min/Max replicas | Boundaries for scaling — HPA will never scale below min or above max |
| Scaling algorithm | HPA calculates desired replicas: desired = ceil(current * (currentMetric / targetMetric)) |
| Stabilization window | Scale-up: no delay (immediate). Scale-down: 5-minute stabilization window. Prevents flapping |
| Metrics Server | Required component that collects resource metrics from kubelets. Installed by default in AKS |
How HPA Works
- The HPA controller queries the Metrics Server every 15 seconds (default)
- It compares the current metric value against the target
- If the current value exceeds the target, it scales up (adds replicas)
- If the current value is below the target, it scales down (removes replicas)
- A 5-minute stabilization window prevents rapid scale-down oscillation (scale-up has no delay)
VPA Overview (Vertical Pod Autoscaler)
While HPA scales horizontally (more Pods), the Vertical Pod Autoscaler (VPA) scales vertically — it adjusts the CPU and memory requests and limits of individual containers.
- VPA monitors actual resource usage and recommends (or automatically applies) right-sized requests
- VPA and HPA are complementary — use HPA for scaling replicas and VPA for right-sizing individual Pods
- VPA should not be used with HPA on the same CPU/memory metric simultaneously
Metrics Server
The Metrics Server is a cluster-wide aggregator of resource usage data. It collects CPU and memory metrics from kubelets and exposes them through the Kubernetes Metrics API.
- AKS: Metrics Server is installed by default — no additional setup required
- Verify with:
kubectl top nodesandkubectl top pods
Mermaid Diagram: HPA Monitoring Loop
flowchart LR
A[Metrics Server] -->|Collects CPU/memory<br>from kubelets| B[HPA Controller]
B -->|Every 15s:<br>query current metrics| A
B -->|Compare current<br>vs target| C{Current > Target?}
C -->|Yes| D[Scale Up<br>Add replicas]
C -->|No| E{Current < Target?}
E -->|Yes| F[Scale Down<br>Remove replicas<br>after 5 min stabilization]
E -->|No| G[No change]
D --> H[Deployment<br>adjusts replica count]
F --> H
G --> B
H --> B
Practical Tasks
All tasks should be performed in your namespace
student-XX. ReplaceXXwith your student number throughout.
Task 1: Deploy a CPU-Intensive Application
Deploy the HPA example application, which is a simple PHP-based web server that performs CPU-intensive computations on each request.
Create a file hpa-demo.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hpa-demo-XX
namespace: student-XX
spec:
replicas: 1
selector:
matchLabels:
app: hpa-demo-XX
template:
metadata:
labels:
app: hpa-demo-XX
spec:
containers:
- name: app
image: registry.k8s.io/hpa-example
ports:
- containerPort: 80
resources:
requests:
cpu: "200m"
limits:
cpu: "500m"
Apply it and expose with a Service:
kubectl apply -f hpa-demo.yaml -n student-XX
kubectl expose deployment hpa-demo-XX --port=80 -n student-XX
Verify the Pod is running and check initial CPU usage:
kubectl get pods -l app=hpa-demo-XX -n student-XX
kubectl top pods -l app=hpa-demo-XX -n student-XX
Note:
kubectl topmay take a minute to start showing metrics for newly created Pods.
Task 2: Create an HPA
Create an HPA that targets 50% average CPU utilization, with a minimum of 1 and maximum of 5 replicas:
kubectl autoscale deployment hpa-demo-XX --cpu-percent=50 --min=1 --max=5 -n student-XX
Verify the HPA was created:
kubectl get hpa hpa-demo-XX -n student-XX
You should see output similar to:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-demo-XX Deployment/hpa-demo-XX <unknown>/50% 1 5 1 10s
The
<unknown>target is normal initially — it takes about 60 seconds for the Metrics Server to start reporting values.
Task 3: Generate Load
Open a second terminal and run a load generator that continuously sends HTTP requests to the application:
kubectl run -i --tty load-generator-XX --rm --image=busybox -n student-XX -- /bin/sh -c "while true; do wget -q -O- http://hpa-demo-XX; done"
This will create a temporary Pod that hammers the service with requests, driving up CPU usage.
Keep this terminal open and the load generator running for the next task.
Task 4: Observe Scale-Up
In your first terminal, watch the HPA and Pods in real time:
kubectl get hpa hpa-demo-XX -w -n student-XX
In a third terminal (or another tab), watch the Pods:
kubectl get pods -l app=hpa-demo-XX -w -n student-XX
Within 1-2 minutes, you should observe:
- CPU utilization rising above 50%
- HPA increasing the replica count
- New Pods being created
The HPA will scale up until CPU utilization per Pod drops below the 50% target.
Task 5: Stop Load and Observe Scale-Down
- Go back to the second terminal and press
Ctrl+Cto stop the load generator - Continue watching the HPA and Pods in the other terminals
- After about 5 minutes (the default scale-down cooldown), HPA will begin reducing the replica count
- Eventually, replicas will return to 1
kubectl get hpa hpa-demo-XX -n student-XX
kubectl get pods -l app=hpa-demo-XX -n student-XX
Cleanup
kubectl delete hpa hpa-demo-XX -n student-XX
kubectl delete svc hpa-demo-XX -n student-XX
kubectl delete deployment hpa-demo-XX -n student-XX
Common Problems
| Problem | Cause | Solution |
|---|---|---|
HPA shows <unknown> for targets |
Metrics Server not ready or Pod just started | Wait 60 seconds, run kubectl top pods to verify metrics are available |
| HPA does not scale up | CPU target not exceeded | Increase load or lower the --cpu-percent target |
kubectl top returns error |
Metrics Server not installed | In AKS, Metrics Server is installed by default. Check with kubectl get pods -n kube-system -l k8s-app=metrics-server |
| Scale-down takes too long | Default stabilization window is 5 minutes | This is expected behavior to prevent flapping |
| Pods have no resource requests | HPA requires resource requests to calculate utilization | Always set resources.requests in your Pod spec |
Best Practices
- Always set resource requests — HPA calculates utilization as a percentage of the requested resources. Without requests, HPA cannot function
- Choose appropriate targets — 50-70% CPU target is a common starting point; too low wastes resources, too high risks latency
- Set reasonable min/max — The minimum should handle baseline traffic; the maximum should stay within cluster capacity
- Combine with Pod Disruption Budgets — Ensure scale-down does not remove all Pods at once
- Use custom metrics for non-CPU workloads — HPA supports custom metrics (e.g., request rate, queue depth) via the Custom Metrics API
- Consider VPA for right-sizing — Use VPA to find optimal resource requests, then use HPA for scaling
Summary
In this exercise you learned:
- The HPA automatically adjusts replica count based on observed metrics (CPU, memory, custom)
- HPA requires the Metrics Server and resource requests to be defined on containers
- The scaling algorithm compares current metrics against the target and adjusts replicas accordingly
- Scale-up is immediate (no delay); scale-down uses a 5-minute stabilization window to prevent oscillation
- VPA complements HPA by right-sizing individual Pod resource requests and limits