Debugging kubernetes issues in production : a technical guide
Kubernetes is a powerful platform for orchestrating containers, but when it comes to diagnosing production issues, even the most experienced engineers can feel lost. This article guides you through best practices and kubectl
commands to efficiently debug problems on a Kubernetes cluster in production.
I will execute all the commands on a live, running cluster and include screenshots of the results. This ensures that the examples are accurate and reflect real-world scenarios. You can trust that the steps provided have been thoroughly tested and validated in an actual Kubernetes environment
1. Understanding cluster health
The first step in debugging an issue is to check the overall health of the cluster. This overview helps identify failing components or resources under pressure.
a. Checking node status
Nodes are the backbone of your cluster. If a node encounters issues, the pods scheduled on it may also be affected.
# Get list of nodes
kubectl get nodes
This command lists all nodes and their statuses (Ready
, NotReady
, etc.). If a node is marked as NotReady
, it's important to dig deeper by checking the associated events:
# Get detailled information about nodes
kubectl describe node <node-name>
This command provides details on why a node might be NotReady
, such as memory issues, disk pressure, or taints not tolerated by the pods, pods running on the nodes, label... It’s a powerful command that will help you to get detailed informations about nodes.
b. Checking core components
Critical components like kube-apiserver
, kube-scheduler
, and kube-controller-manager
must function correctly to ensure cluster health.
# The command below is depreciated up to kubernetes 1.9
kubectl get componentstatuses
# You can use this command instead
kubectl get --raw='/readyz?verbose'
This command checks the status of the core components. If any component is marked as Unhealthy
, it requires immediate attention.
2. Diagnosing pod issues
Pods are the units of execution in Kubernetes, and most production issues manifest as failing pods.
a. Get all pods that are not in the running state
This command will scan all the pods of all the namespace and list those that are not in a running state. It’s a good command to quickly identified pod bad status
# Get all the pods of all the namespace that are not in Running state
kubect get pod -A | grep -v "Running"
b. Inspecting pod status
Instead of scanning all the namespace, you can checking the overall status of the pods in the relevant namespace:
kubectl get pods -n <namespace-name>
The STATUS
, RESTARTS
, and AGE
columns provide an initial indication of which pods are facing issues. For additional details on a specific pod, use:
kubectl describe pod <pod-name> -n <namespace-name>
This command shows recent events, errors, and status messages.
b. Exploring container logs
If a pod fails or behaves unexpectedly, container logs often provide valuable clues:
kubectl logs <pod-name> -n <namespace-name> --previous
The --previous
flag is particularly useful for viewing logs from a container that crashed and was restarted.
You can also tail the last x line of your pod logs using this command
# Get 200 last line of your pod logs
kubectl logs <pod-name> -n <namespace-name> --tail=200
The --tail
flag is particularly useful for viewing the previous x logs of the container. If you want to see the logs in live, you can replace them with the option -f
without any number.
c. Accessing a live container
For deeper exploration, it might be necessary to run commands inside a live container:
kubectl exec -it <pod-name> -n <namespace-name> -- /bin/bash
This opens an interactive shell, allowing you to inspect files, configuration, or run diagnostics directly in the container.
3. Resolving network issues
Network issues are common in distributed environments like Kubernetes. Here are some steps to diagnose connectivity problems.
a. Checking services and endpoints
Kubernetes services abstract pods and provide a stable network interface. If a service is not functioning as expected, start by checking if it is correctly configured:
kubectl get svc -n <namespace-name>
Next, check the associated endpoints to ensure the backend pods are correctly linked:
kubectl get endpoints <service-name> -n <namespace-name>
A service with no endpoints may indicate that the backend pods are failing or unavailable.
b. Debugging ingress and load balancers
For applications exposed outside the cluster, ingress controllers and load balancers play a crucial role. Check their status with:
kubectl describe ingress <ingress-name> -n <namespace-name>
This command shows configuration details and associated events. Look for configuration errors, missing SSL certificates, or routing issues.
c. Testing intra cluster connectivity
To test connectivity between pods, you can use a curl
or ping
command from one pod to another:
kubectl exec -it <source-pod-name> -n <namespace-name> -- curl <destination-service>:<port>
If connectivity fails, it could indicate a deeper network problem, such as a firewall rule blocking traffic or a misconfigured network policy.
4. Using advanced debugging tools
Beyond kubectl
, there are specific tools to diagnose complex issues in production.
a. kubectl-debug
The kubectl-debug
tool allows you to launch a debug container (like busybox
or alpine
) attached to an existing pod, giving you more flexibility to inspect the runtime environment:
kubectl debug <pod-name> -n <namespace-name> --image=busybox --target=<container-name>
This tool is especially useful for issues requiring root access or additional tools not present in the original container.
b. k9s
k9s
is a terminal-based user interface for interacting with Kubernetes clusters. It offers a real-time view of pods, services, and other resources, making it easier to quickly identify issues.
k9s
The advantage of k9s
is its ability to display logs, events, and configurations in a single interface, significantly speeding up the debugging process.
c. kail
kail
is a terminal-based user interface for interacting with Kubernetes clusters. It offers a real-time logs of you pod based on namespace.
kail -n danielscool
Conclusion
Debugging production issues on Kubernetes requires a methodical approach and effective use of the tools at your disposal. By using the kubectl
commands detailed above, you can quickly identify problems and take appropriate action. Remember, the key is to always start with a global diagnosis before diving into specific details.
Share your experiences and feel free to contribute additional tips and tools you use to debug your Kubernetes clusters in production. Happy debugging!
I share all these and more on this book available on Amazon : https://urls.fr/eHqOib
I’m Hervé-Gaël KOUAMO, Founder and CTO at HK-TECH, a French tech company specializing in designing, building, and optimizing applications. We also assist businesses throughout their cloud migration journey, ensuring seamless transitions and maximizing their digital potential. You can follow me on LinkedIn (I post mostly in French there): https://www.linkedin.com/in/herv%C3%A9-ga%C3%ABl-kouamo-157633197/