If you’ve worked with containers in a production environment, you’ve probably come across, (or developed an intimate relationship with), the open source container orchestration platform Kubernetes. Want to efficiently run, orchestrate, and scale containerized applications? Kubernetes. K8s (the abbreviation for Kubernetes) can restart failed containers, load balance, horizontally scale, and more. Kubernetes basically ensures the resiliency, scalability, and failover of your containerized applications. However, sometimes even the most awesome and resilient systems fail, and K8s is no exception. When failures happen, they can have huge impacts on customers and your business. In this post, let’s learn about the common issues that can occur in K8s so we can then detect and resolve them quickly.
Node issues, pod failures, and container failures (often seen as restart loops) are most commonly the result of resource limitations. Properly setting resource limits and requests requires finding the Goldilocks zone. If resources are set too low, applications can crash due to out-of-memory exceptions. If they’re set too high, it can result in wasted resources and high costs. Not setting limits at all can lead to overprovisioning and applications running wild. While the #1 Kubernetes problem is always resources, K8s clusters and/or their components can fail for a number of reasons.
If proper limits and requests aren’t configured, nodes themselves can experience resource pressure for resources including memory, disk, and PIDs. If resources can’t be reclaimed, node status errors like NotReady can pop up, and the unhealthy node won’t be able to accept pods.
Similarly, if a pod can’t successfully be scheduled to run on a node because of resource misconfigurations or exhaustion, pods can get stuck in a Pending phase or return a status of Failed or Unknown. To get a pod out of the Pending state, a node needs to exist with sufficient resources in order to schedule the pod. If those resources aren’t available, the pod will remain pending until they are.
Sometimes containers just don’t behave as expected thanks to issues in the build or CI process. Containers can also hit similar resource limitations and get stuck in a Waiting or Failure state. Misconfigurations of resource limits, application bugs, dependency failures, health check failures, or network issues can lead to container restart loops and errors like CrashLoopBackOff, RunContainerError, OOMKilled, ImagePullBackOff, etc.
No matter where errors manifest in a Kubernetes cluster, they’re typically the result of:
When any of these problems occur, the application is no longer running as designed. This can create increased load on the remaining components and cause applications to fail.
Ultimately, if Kubernetes clusters fail, users are negatively impacted, and that’s not great. In fact, a negative user experience is the exact opposite of what we and our businesses are trying to provide.
To discover and troubleshoot these common Kubernetes problems, it’s easy to run a command like kubectl describe pods to get detailed information about a pod and its containers. However, command line output can be a little too detailed, hard to parse, and only provides instantaneous snapshots.
Reading screens full of text for all your various pods is a lot to handle when seconds matter during an outage.
Command line tools can also require on-the-fly recall of the difference between things like a ConfigMap and a regular config, a Daemon and a DaemonSet, and what about a StatefulSet? And the Kubernetes Dashboard just doesn’t scale or help uncover the root cause of problems in a production environment.
So now we know the common problems we can come across in our Kubernetes environments and the scale of their impacts, but how can we avoid or resolve these issues? We can build perfect software, or we can anticipate resource limitations and pinpoint failures fast through observability. Observability it is!
For a solution with comprehensive monitoring, logging, and alerting capabilities, along with powerful analytics and visualization tools, we’ll next dig into these common Kubernetes issues using Splunk Observability Cloud. To dive in right along with us, start your 14 day free Splunk Observability Cloud trial now.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.