Detect and Resolve Issues in a Kubernetes Environment

CaitlinHalla · ‎07-25-2024

We’ve gone through common problems one can encounter in a Kubernetes environment, their impacts, and the importance of fast resolution. Now we’ll dig into a scalable observability solution that provides an easy-to-digest overview of our K8s architecture, highlighting real-time issues, allowing us to act fast and mitigate impact.

Splunk Observability Cloud

To resolve performance issues faster and increase the reliability of Kubernetes environments, Splunk Infrastructure Monitoring’s Kubernetes Navigator provides real-time insight into your K8s architecture and performance. With the Kubernetes Navigator, we can detect, triage, and resolve cluster issues quickly and easily and have fun while doin’ it.

You can always enter the Kubernetes Navigator from related content links throughout Splunk Observability Cloud – like if you’re diagnosing an issue in APM, and you want to explore your cluster health to see if it’s causing the problem. But today, we’re going to start fresh and go directly into the Kubernetes Navigator.

Jumping into the Kubernetes filter in Splunk Infrastructure Monitoring, we can see our two navigators: one for Kubernetes nodes and one for Kubernetes workloads. The Kubernetes workloads navigator provides insight into workloads or applications running on K8s. The Kubernetes nodes navigator provides an overview of the performance of clusters, nodes, pods, and containers. Since our current focus is on cluster health, we’ll look into the Kubernetes nodes navigator.

1-K8s Nav start.png

From the Kubernetes nodes navigator, we get an overview of our clusters and their statuses and node dependencies. If we scroll down, we’ll see out-of-the-box charts that provide fast insight into those common issues like resource pressure and node status.

2-K8s node start.png

Additional magic happens in the K8s analyzer. Here we’ll see an overview of pretty much all of those common problems we mentioned in our last post – nodes with memory pressure, high CPU, containers that are restarting too frequently, and abnormal pod and node statuses. As with any Navigator, we can filter this data to examine specific clusters and scope our overview.

Screenshot 2024-07-24 at 8.43.38 PM.png

We can dig into a specific node in the cluster by clicking on it in the heat map view of the cluster, by applying a filter, or by selecting a namespace of interest from one of the analyzer tables. Once we’re viewing the specific node, we get insight into the same helpful health info, and we can quickly diagnose node status, resource pressure, pod status, and container health.

By hovering over one of the nodes in this scoped cluster, it looks like it’s experiencing some memory pressure, (you may have also noticed this called out in the K8s analyzer).

Screenshot 2024-07-24 at 8.45.11 PM.png

When we click into it, right off the bat, we can see this node is Not Ready:

Screenshot 2024-07-24 at 8.46.20 PM.png

Scrolling down through the metrics for this node, it looks like there’s a pod consuming significantly more memory than the other. If we click on that chart, we’ll get this data view:

Screenshot 2024-07-24 at 8.53.00 PM.png

The fact that this node condition chart was outlined in red means that it’s already linked to a Detector and has an active alert firing. Our team was probably already paged for this alert before our casual exploration even started, meaning resolution is already underway. If we select the alert at the top right of the screen, we can view the open alert and expand its details to explore it even further. We can also jump into APM from here and see the effect this infrastructure issue is having on our app.

You might have noticed this alert is tagged with Autodetect. What’s that all about? This type of alert is a Splunk AutoDetect detector. AutoDetect detectors are automatically created in Splunk Observability Cloud to discover common and high-impact anomalies in your Kubernetes infrastructure quickly. No manual creation of custom detectors required, (although, you can totally do that if you want to).

Kubernetes AutoDetect detectors out-of-the-box include:

Cluster DaemonSet ready versus scheduled
Cluster Deployment is not at spec
Container Restart Count > 0
Node memory utilization is high
Nodes are not ready

Conveniently, these AutoDetect detectors alert on all the primary Kubernetes problems discussed in our previous post. So rather than constantly running commands or popping into your Kubernetes Dashboard, you’ll proactively get alerted to issues with your cluster for faster diagnosis and resolution.

We found our culprit quickly in one of our K8s nodes, we allocated some additional memory, and our node and application recovered to a healthy state. But if we needed to, we could also dig into pods and containers, either by clicking into them or applying a filter. From the pod and container views, we can see things like resource usage per pod and number of active containers.

Screenshot 2024-07-24 at 8.51.21 PM.png

By exploring the pods and containers in these views we can compare real-time usage to limits, (even uncover where limits aren’t set), and ensure resources are properly allocated to prevent possible pressure.

Wrap up

If you don’t yet have Splunk Observability Cloud hooked up to your Kubernetes environment, and you want to explore and ensure the health of your services, we got you! Setting up the Kubernetes Navigator is as quick and easy as starting Splunk Observability Cloud 14 day free trial, installing the OpenTelemetry Collector, and watching your data flow in. Spend minutes to get up and running; spend an application lifetime ensuring the health and resiliency of your Kubernetes environment.

Resources

To further explore the concepts we discussed, check out the following super cool resources:

Detect and Resolve Issues in a Kubernetes Environment

Splunk Observability Cloud

Wrap up

Resources

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Join the Conversation

Detect and Resolve Issues in a Kubernetes Environment

Splunk Observability Cloud

Wrap up

Resources

Data Management Digest – December 2025

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...