Horizontal Pod Autoscaling — An Aspect of Elasticity in Distributed Architecture

In 2020, 3rd and 4th quarter I was actively involved with a European bank to help them for go-live of their new internet and mobile banking platform in AWS public cloud. Although architecture of the new banking platform was hybrid and multi-cloud, quite a lot of critical components were running as containerised microservices on AWS EKS managed cluster. Bank was facing massive challenges with the resiliency, reliability and performance of these components in their pre-production environment during performance and endurance testing, as they were apprehending a heavy load on some of these microservices such as KYC and GDPR update, payment/fund transfers etc. due to its go-live timeline during Black Friday. It was expected that during this time there would be heavy demand of resources by the Pods ( i.e. abstraction of containerized microservices) in terms of its number, CPU, memory etc. and Pods had to scale elastically up and down to meet those demands. In this post I’ll discuss in high level and in simple way how this elastic autoscaling was applied using Kubernetes Horizontal POD Autoscaling feature to address and resolve above challenge.

We all know very well that Kubernetes is the best solution as of today to automate the orchestration and management of distributed applications in the form of large number of immutable PODs having declaratively expressed their desired state. Still, it is not easy task to define a desired state for seasonal workloads — such as resource requirements of Pods , number of replicas, number of nodes in cluster etc. While we all know what Horizontal and Vertical scaling means for application scaling, Kubernetes allows these tasks to be done either manually or in fully automated fashion given some rules. Kubernetes can monitor any external load and capacity related events and then based on its analysis of current state, it can scale itself to desired performance and resiliency. This brings antifragility in distributed architecture based on actual usage rather anticipated factors. Let’s see how with the help of Horizontal POD Autoscaling (HPA) this can be achieved.

Horizontal Pod Autoscaling — As dynamic natured of any workload makes it difficult to have fixed scaling configuration, most straight forward approach to achieve this by using HorizintalPodAutoscaler (HPA) to horizontally scale number of Pods. Please note that for HPA to work a metrics server, which aggregates cluster wide resource usage, has to be enabled.

How it works — Following diagram schematically shows how HPA works. To set up AWS EKS HPA please refer to here.

When following command is issued from command line

Kubectl autoscale deployment <deployment name> — cpu-percent=50 -

-min = 1 — max=5

it creates HPA definition where minimum number of Pods set up as 1, that should always run and maximum number of Pods till which HPA can scale up is set as 5. It also sets desired average CPU utilization as 50%. It means when Pods have CPU of 200m, scale-up happens when on average more than 100m CPU is utilized. Also note that although apart from Deployment resource HPA can be applied to ReplicaSets and StatefulSets, however it is recommended to use it with higher level resource Deployment as it applies HPA to the new lower level resource ReplicaSets.

W.r.t the above diagram let’s try to understand in simple way what happens behind the scene:

1. HPA controller retrieves metrics of the Pods ,which are subjected to scaling , from Kubernetes Metrics API. Both aggregated metrics and Pod level resource metrics are obtained from Metrics API.

2. Then it calculates required number of replicas based on current and desired metric value. You may consider this simplified version of formula :

Desiredreplica = currentreplicas X currentmetricvalue


For the above command example, with current CPU usage metric value being 90% of the specified CPU resource request value and desired value being 50%, the number of replicas will be = [1 X 90/50] = 2.

3. In the HPA definition of auto-scaled resource replicas field will be updated with the calculated number and controller will work with this new desired state.

Actual implementation in Kubernetes is much more complex with consideration of different metric types and multiple running instances of Pods etc. It’s an area with many low level details and evolving rapidly. Like Horizontal Autoscaling, Vertical Autoscaling also provides elasticity within the cluster capacity. But, Cluster Autoscaling provides elasticity at the cluster capacity level. It is both complimentary and decoupled to the other two scaling methods.For further detailed reading , refer here and here.