Debugging kubernetes issues in production : a technical guide

Herve Khg
6 min readAug 18, 2024

--

Kubernetes is a powerful platform for orchestrating containers, but when it comes to diagnosing production issues, even the most experienced engineers can feel lost. This article guides you through best practices and kubectl commands to efficiently debug problems on a Kubernetes cluster in production.

I will execute all the commands on a live, running cluster and include screenshots of the results. This ensures that the examples are accurate and reflect real-world scenarios. You can trust that the steps provided have been thoroughly tested and validated in an actual Kubernetes environment

1. Understanding cluster health

The first step in debugging an issue is to check the overall health of the cluster. This overview helps identify failing components or resources under pressure.

a. Checking node status

Nodes are the backbone of your cluster. If a node encounters issues, the pods scheduled on it may also be affected.

# Get list of nodes
kubectl get nodes

This command lists all nodes and their statuses (Ready, NotReady, etc.). If a node is marked as NotReady, it's important to dig deeper by checking the associated events:

Get list of nodes
# Get detailled information about nodes
kubectl describe node <node-name>

This command provides details on why a node might be NotReady, such as memory issues, disk pressure, or taints not tolerated by the pods, pods running on the nodes, label... It’s a powerful command that will help you to get detailed informations about nodes.

Describe a node

b. Checking core components

Critical components like kube-apiserver, kube-scheduler, and kube-controller-manager must function correctly to ensure cluster health.

# The command below is depreciated up to kubernetes 1.9
kubectl get componentstatuses

# You can use this command instead
kubectl get --raw='/readyz?verbose'

This command checks the status of the core components. If any component is marked as Unhealthy, it requires immediate attention.

Get components status

2. Diagnosing pod issues

Pods are the units of execution in Kubernetes, and most production issues manifest as failing pods.

a. Get all pods that are not in the running state

This command will scan all the pods of all the namespace and list those that are not in a running state. It’s a good command to quickly identified pod bad status

# Get all the pods of all the namespace that are not in Running state
kubect get pod -A | grep -v "Running"

b. Inspecting pod status

Instead of scanning all the namespace, you can checking the overall status of the pods in the relevant namespace:

kubectl get pods -n <namespace-name>
get pod status in gisalind namespace

The STATUS, RESTARTS, and AGE columns provide an initial indication of which pods are facing issues. For additional details on a specific pod, use:

kubectl describe pod <pod-name> -n <namespace-name>

This command shows recent events, errors, and status messages.

b. Exploring container logs

If a pod fails or behaves unexpectedly, container logs often provide valuable clues:

kubectl logs <pod-name> -n <namespace-name> --previous

The --previous flag is particularly useful for viewing logs from a container that crashed and was restarted.

You can also tail the last x line of your pod logs using this command

# Get 200 last line of your pod logs
kubectl logs <pod-name> -n <namespace-name> --tail=200
List the 200 last line log of danielscool-backend

The --tail flag is particularly useful for viewing the previous x logs of the container. If you want to see the logs in live, you can replace them with the option -f without any number.

c. Accessing a live container

For deeper exploration, it might be necessary to run commands inside a live container:

kubectl exec -it <pod-name> -n <namespace-name> -- /bin/bash

This opens an interactive shell, allowing you to inspect files, configuration, or run diagnostics directly in the container.

Connecting to MySQL container

3. Resolving network issues

Network issues are common in distributed environments like Kubernetes. Here are some steps to diagnose connectivity problems.

a. Checking services and endpoints

Kubernetes services abstract pods and provide a stable network interface. If a service is not functioning as expected, start by checking if it is correctly configured:

kubectl get svc -n <namespace-name>
Get service in namespace danielscool

Next, check the associated endpoints to ensure the backend pods are correctly linked:

kubectl get endpoints <service-name> -n <namespace-name>

A service with no endpoints may indicate that the backend pods are failing or unavailable.

Get associated endpoints to Danielscool svc

b. Debugging ingress and load balancers

For applications exposed outside the cluster, ingress controllers and load balancers play a crucial role. Check their status with:

kubectl describe ingress <ingress-name> -n <namespace-name>

This command shows configuration details and associated events. Look for configuration errors, missing SSL certificates, or routing issues.

c. Testing intra cluster connectivity

To test connectivity between pods, you can use a curl or ping command from one pod to another:

kubectl exec -it <source-pod-name> -n <namespace-name> -- curl <destination-service>:<port>

If connectivity fails, it could indicate a deeper network problem, such as a firewall rule blocking traffic or a misconfigured network policy.

4. Using advanced debugging tools

Beyond kubectl, there are specific tools to diagnose complex issues in production.

a. kubectl-debug

The kubectl-debug tool allows you to launch a debug container (like busybox or alpine) attached to an existing pod, giving you more flexibility to inspect the runtime environment:

kubectl debug <pod-name> -n <namespace-name> --image=busybox --target=<container-name>

This tool is especially useful for issues requiring root access or additional tools not present in the original container.

b. k9s

k9s is a terminal-based user interface for interacting with Kubernetes clusters. It offers a real-time view of pods, services, and other resources, making it easier to quickly identify issues.

k9s

The advantage of k9s is its ability to display logs, events, and configurations in a single interface, significantly speeding up the debugging process.

A screenshot of k9s command

c. kail

kail is a terminal-based user interface for interacting with Kubernetes clusters. It offers a real-time logs of you pod based on namespace.

kail -n danielscool
kail command in namespace danielscool

Conclusion

Debugging production issues on Kubernetes requires a methodical approach and effective use of the tools at your disposal. By using the kubectl commands detailed above, you can quickly identify problems and take appropriate action. Remember, the key is to always start with a global diagnosis before diving into specific details.

Share your experiences and feel free to contribute additional tips and tools you use to debug your Kubernetes clusters in production. Happy debugging!

I’m Hervé-Gaël KOUAMO, Founder and CTO at HK-TECH, a French tech company specializing in designing, building, and optimizing applications. We also assist businesses throughout their cloud migration journey, ensuring seamless transitions and maximizing their digital potential. You can follow me on LinkedIn (I post mostly in French there): https://www.linkedin.com/in/herv%C3%A9-ga%C3%ABl-kouamo-157633197/

--

--

Herve Khg

CTO at HK-TECH. We are building web and mobile. High hands on experience in cloud infrastructure