Mistake that cost thousands (Kubernetes, GKE)

Lessons learned scaling Kubernetes cluster

No exaggeration, unfortunately. As a disclaimer, I will add that this is a really stupid mistake and shows my lack of experience managing auto-scaling deployments. However, it all started with a question with no answer and I feel obliged to share my learnings to help others avoid similar pitfalls.

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

I asked this question multiple times in Kubernetes community. No one suggested an answer. If you are unsure about the answer, then there is something for you to learn from my experience (or skip to Answer for the impatient). Here it goes:


I woke up middle of the night with a determination to reduce our infrastructure costs.

We are running a large Kubernetes cluster. “large” is relative of course. In our case that is 600 vCPUs during normal business hours. This number goes double during peak hours and goes to near 0 during some hours of the night.

Invoice for the last month was USD 3,500.

This is already pretty darn good given the computing power that we get, and Google Kubernetes Engine (GKE) made cost management mostly easy:

Using exclusively preemptible VMs is what allows us to keeps the costs low. To illustrate the savings, in case of n1-standard-1 machine type hosted in europe-west4, the difference between dedicated and preemptible VM is USD 26.73/ month VS USD 8.03/ month. That is 3.25x lower cost. Of course, preemptible VMs have their limitations that you need to familiarise with and counteract, but that is a whole different topic.

With all of the above in place, it felt like we are doing all the right things to keep the costs low. However, I always had a nagging feeling that something is off.

Major red flag 🚩

About that nagging feeling:

Average CPU usage per Node was low (10%-20%). This didn’t seem right.

My first thought was that I have misconfigured compute resources. What resource are required depends entirely on the program that you are running. Therefore, the best thing to do is to deploy your program without resource limits, observe how your program behaves during idle/ regular and peak loads, and set requested/ limit resources based on the observed values.

I will illustrate my mistake through an example of a single deployment “admdesl”.

In our case, resource requirements are sporadic:

Pods that average 5m are “idle”: there is a task in the queue for them to process, but we are waiting for some (external) condition to clear before proceeding. In case of this particular deployment, these pods will change between idle/ active state multiple times every minute and spend 70%+ in idle state.

A minute later the same set of pods will look different:

Looking at the above I thought that it makes sense to have a configuration such as:

This translates to:

  • idle pods don’t consume more than 20m
  • active (healthy) pods peak at 200m

However, when I used this configuration, it made deployments hectic.

This would happen whenever a new Node is brought in/ out of the cluster (which happens often due to auto-scaling).

As such, I kept increasing requested pod resources until I have ended up with the following configuration for this deployment:

With this configuration the cluster was running smoothly, but it meant that even idle Pods were pre-allocated more CPU time than they need. This is the reason why the average CPU usage per Node was low. However, I didn’t know what is the solution (reducing requested resources resulted in hectic cluster state/ outages) and as such I rolled with a variation of generous resource allocation for all the deployments.


Back to my question:

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

For starters, there is no price-per-vCPU difference between n1-standard-1 and n1-standard-96. Therefore, I reasoned that using a machine with fewer vCPUs is going to give me more granular control over the price.

The other consideration I had was how fast the cluster will auto-scale, i.e. if there is a sudden surge, how fast can the cluster auto scaler provision new nodes for the unscheduled pods. This was not a concern though — our resource requirements grow and shrink gradually.

And so I went with mostly 1 vCPU nodes, the consequence of which I have described in Premise.

Retrospectively, it was an obvious mistake: distributing pods across nodes with a single vCPU does not allow efficient resource utilisation as individual deployments change between idle and active states. Put it another way, the more vCPUs you have on the same machine, the more tightly you can pack many pods because as a portion of pods go over their required quota, there are readily available resources to take.

What worked:

  • I switched to 16 vCPU machines because they provide a balanced solution between fine resource control when auto-scaling the cluster and sufficient resources per machine to enable tight scheduling of pods that are going through idle/ active states.
  • I used resource configuration that requests only marginally more than the resources that are needed during an idle state, but have generous limits. It allows to have many pods scheduled on the same machine when majority of the pods are in an idle state, but still allows resource intensive bursts.
  • I switched to n2 machine type: n2 machines are more expensive, but they have 2.8 GHz base frequency (compare with ~2.2 GHz available to n1-* machines). We are taking an advantage of a higher clock frequency to process resource intensive tasks as fast as possible and put pods into the earlier described idle state as quick as possible.

Written by

Software architect, startup adviser. Editor of https://medium.com/applaudience. Founder of https://go2cinema.com.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store