Today I came across an issue when our development team told me that the web app failed after browser reload. My reaction without thinking was the readiness check without any thinking. But after realised that the web app works in the browser once. Only failed after refresh the page. Then readiness check wouldn’t be cause of the issue. (If the readiness check is less than the time application use for start. It wont work in the first place.)
The troubleshooting steps I have tried were log into GKE console and take a look the pod from Workloads tab. And I found the pod was restart for 16 times. That actually expand why the service was offline after reload. Because the service tried to restart pods to get it back online.
I then, connect to the GKE cluster from the cloud console. describe the pods that serving the service. and describe it.
kubectl --namespace production describe pod/my-frontend-service
It shows the pod was failed due to OOMKilled reason. OK, it’s quite obvious that the pod require more memory to be live.
The next thing I’ve done is to go to the deployment.yaml and increase the memory for a quick straight fix. After roullout the change, thing back to work again.
Moving forward, I will need to monitor the utilisation of the pod to see what is the cpu and memory that required in a period of time. and then define the default cpu and memory.