Thursday, 21 September 2017

Kernel tuning in Kubernetes

Kubernetes is a great piece of technology, it trivialises things that 5 years ago required solid ops knowledge and makes DevOps attainable for the masses. More and more people start jumping on the Kubernetes bandwagon nowadays, however sooner or later people realise that a pod is not exactly a small VM magically managed by Kubernetes.

Kernel tuning

So you've deployed your app in Kubernetes, everything works great, you just need to scale now. What do you do to scale? The naive answer is: add more pods. Sure, you can do that, but pods will quickly hit kernel limits. One particular kernel parameter I have in mind is net.core.somaxconn. This parameter represents the maximum number of connections that can be queued for acceptance. The default value on Linux is 128, which is rather low:
root@x:/# sysctl -a | grep "net.core.somaxconn"
net.core.somaxconn = 128
You might get away with not increasing it, but I believe it's wasteful to create new pods unless there is a cpu or memory need.

In order to update a sysctl parameter, in a normal Linux VM, the following command will just work:
root@x:/# sysctl -w net.core.somaxconn=10000    
However, try it in a pod and you get this:
root@x:/# sysctl -w net.core.somaxconn=10000  
sysctl: setting key "net.core.somaxconn": Read-only file system
Now you start realising that this is not your standard Linux VM where you can do whatever you want, this is Kubernetes' turf and you have to play by its rules.

Docker image baking

At this point you're probably wondering why am I not just baking the sysctl parameters into the docker image. The docker image could be as simple as this:
FROM alpine
#Increase the number of connections
RUN echo "net.core.somaxconn=10000" >> /etc/sysctl.conf
Well, the bad news is that it doesn't work. As soon as you deploy the app and connect to the pod, the kernel parameter is still 128. Variations on the docker image baking theme can be attempted, however I could not successfully do it and I have a strong feeling that it's not doable this way.

Kubernetes sysctl

There is documentation around sysctl in Kubernetes here. So Kubernetes acknowledges the fact that kernel tuning is required sometimes and provides explicit support for that. Then it should be as easy as following the documentation, right? Not quite, the documentation is a bit vague and it didn't quite work for me. I'm not creating pods directly as described in the documentation, I am using deployments.

After a bit of research, I did find a way to do it, via init containers. Init containers are specialised Containers that run before app Containers and can contain utilities or setup scripts not present in an app image. Let's see an example:

In order for this to work you will need to create a custom ubuntu image that never terminates. It's a bit of a hack, I know, but the point is to keep the pod running so that we can connect and inspect that the sysctl changes were successfully applied. This is the Dockerfile:

In order to test this we have to build the image first:
docker build -t ubuntu-test-image .
I am using minikube locally, for testing, so I'm creating the deployment like this:
kubectl apply -f ubuntu-deployment.yml
In a Kubernetes cluster, the config has to be provided with the command.
Once the deployment was created let's log into the pod and inspect the changes. First we need to find the pod id:
bogdan@x:/$ kubectl get pods | grep ubuntu-test
ubuntu-test-1420464772-v0vr1     1/1       Running            0          11m
Once we have the pod id, we can log into the pod like this:
 kubectl exec -it ubuntu-test-1420464772-v0vr1   -- /bin/bash
Then check the sysctl parameters:
:/#  sysctl -a | grep "net.core.somaxconn" 
net.core.somaxconn = 10000 

:/# sysctl -a | grep "net.ipv4.ip_local_port_range" 
net.ipv4.ip_local_port_range = 1024     65535
 As you can see the changes were applied. 

1 comment: