Auto-scaling for a three-node cluster: when not to use HPA
Kubernetes ships with three autoscaling mechanisms. We use one of them. Here is how we decide which knob to turn at small scale.
Most Kubernetes auto-scaling content is written for fleets of fifty nodes. We run three. The math is different — and most of what tutorials recommend is overkill for the cluster size that early-stage startups actually run.
Here is how we think about scaling on our own K3s cluster, where the three autoscalers live in the spec, and which ones we actually turn on.
The three autoscalers in the spec
Kubernetes ships three official scaling mechanisms:
- HPA (Horizontal Pod Autoscaler) — adjusts the number of replicas of a pod up and down based on CPU, memory, or custom metrics.
- VPA (Vertical Pod Autoscaler) — adjusts the CPU and memory requests of a pod based on observed usage.
- Cluster Autoscaler — adds and removes nodes to match the total resource demand.
All three are useful. None of them is mandatory. The cost of using each one is non-zero — they all introduce moving parts, potential for thrash, and another thing to debug when something goes wrong.
What we run, in one paragraph
On our three-node cluster, we use HPA only for two specific workloads: services with bursty, unpredictable traffic patterns. Everything else runs at a fixed replica count. We do not use VPA. We do not use Cluster Autoscaler (we picked our node count, and we will pick a different one when we scale; no thrash). The cluster has been up for months and has not needed any of the auto-scaling machinery to fire at us.
This is not a generic recommendation. It is a careful right-sizing for our actual traffic shape.
When HPA earns its place
HPA is useful when:
- Traffic genuinely varies by an order of magnitude across the day. Not “doubles at lunch” — “is silent at night, then ten engineers arrive in the morning”. Our match-service hits this shape.
- A failed scale-up does not hurt. HPA reacts to historical metrics, not predicted ones. If your traffic spikes faster than HPA can scale (in 15-60 seconds), HPA will not save you.
- You can size pods small enough that scaling out is cheap. HPA works in units of pods. If a pod takes 30 seconds to come ready (Java JVMs, large model loads), HPA is the wrong tool — pre-warm a fixed replica count instead.
- You have observability that confirms the metric you are scaling on. Scaling on CPU is fine if CPU is the actual bottleneck. Scaling on CPU when the bottleneck is database connections leads to a stampede.
When HPA does not earn its place:
- For most internal tools and admin surfaces. A fixed
replicaCount: 2(so a node loss does not page you) is correct. HPA on a tool five people use is theatre. - For databases or stateful services. HPA on a Postgres pod is not what you want — those replicas need careful coordination, not arbitrary count changes.
- For services where the traffic shape is roughly flat. Adding HPA where it never fires adds nothing but operational risk.
What VPA actually costs you
VPA sounds magical: it watches your pods and adjusts their CPU/memory requests so you stop overprovisioning. In practice we do not run it. The reasons:
- VPA recreates pods when it changes requests. For long-running services this is a graceful rolling restart you did not ask for. For batch jobs it can derail an in-progress run.
- VPA recommendations need a human review at any scale below “fleet of hundreds”. The savings on three nodes are usually a few percent of memory — not worth the operational surprise of pods restarting on their own.
- Right-sizing once at startup is usually sufficient at small scale. We do this by hand: ship the service, observe it for a week, set the requests to a value that covers p95 with headroom. Done. VPA is solving a problem we do not have.
If we ran fifty pods of fifty different services on fifty nodes, VPA would pay back. At three nodes, it does not.
What Cluster Autoscaler actually costs you
Cluster Autoscaler is the one that pages people. The promise is “automatic node management”; the reality is:
- Provisioning a new node is slow. Most clouds need a minute or two. If demand spikes faster than that, your application 503s during the scale-up window.
- Removing a node is destructive. Cluster Autoscaler drains a node and reschedules pods elsewhere. PVCs do not move (they stay attached to the original node’s availability zone). PodDisruptionBudgets can block the drain entirely. Both are real footguns.
- You still have to pick a cap. If you do not, a runaway workload eats your entire cloud budget overnight.
We picked three nodes because three nodes is enough for our load. When it stops being enough, we will pick four and provision it explicitly. The cost of doing this by hand is one Ansible command every few months. The benefit is no autoscaler stampede surprises.
The decision rubric
Here is the rubric we use when a new service comes online:
| Question | If yes | If no |
|---|---|---|
| Does traffic vary by >5× across the day? | Consider HPA | Fixed replicas |
| Can pods reach ready in <15s? | HPA is viable | Pre-warm fixed count |
| Is the bottleneck CPU (not DB / IO / network)? | Scale on CPU | Scale on the actual bottleneck OR don’t |
| Will a 60s scale-up window hurt users? | Skip HPA — keep headroom | HPA is fine |
| Is the workload stateful? | No HPA | HPA is fine |
When the rubric points to HPA, we set minReplicas: 2 (always have redundancy), pick a sensible maxReplicas (we cap at 6 for most services), and target 70% CPU utilization. We monitor it for a week before trusting it.
The general lesson
Auto-scaling machinery is a tool. Like any tool, it is useful for specific problems and counterproductive for others. The default in our cluster is not to use it; we add it when a service has a traffic shape that genuinely benefits.
If you are running a small Kubernetes cluster — three to ten nodes, a handful of services — the chances are high that fixed replica counts plus careful right-sizing of requests will serve you better than a layer of autoscaling abstraction.
If you would like a second opinion on the scaling shape of your own cluster, we are an email away.