Splunk Tech Talks
Deep-dives for technical practitioners.

Observability Unlocked: Kubernetes Monitoring with Splunk Observability Cloud

DayaSCanales
Splunk Employee
Splunk Employee

Screenshot 2025-10-28 172556.png

 Ready to master Kubernetes and cloud monitoring like the pros?

Join Splunk’s Growth Engineering team for an exclusive deep dive into Splunk Infrastructure Monitoring (IM)

Learn how they monitor 70+ tech stacks, detect issues within minutes, and resolve them 90% faster—all while optimizing and scaling seamlessly.

 

Watch the Replay:

 

💡Key insights include:

  •  OpenTelemetry (OTel) Integration
  • Faster Incident Detection & Resolution
  • Infrastructure Optimization & Scalability

Don’t miss out on the chance to level up your monitoring game.

DayaSCanales
Splunk Employee
Splunk Employee

Here are a few top of mind questions from the live Tech Talk

 

Q. Can the Splunk OpenTelemetry collector run without any initContainter (like rhel/ubi image) to patch the application log directories/files (for the file permissions) without a root privilege?

A. Yes, the Splunk OpenTelemetry Collector can run without an initContainer and without root privileges to handle log file permissions. This is achieved by configuring Kubernetes security Context (e.g., fsGroup) for shared volumes or by ensuring the application writes logs with appropriate permissions that allow the non-root collect or user to read them. The Collector is designed to run as a non-root user for most use cases.

DayaSCanales_0-1761849884810.png

Q. What is Agentic AI Use cases can be achieved by O11y Cloud and What are agentic AI capabilities are present in O11y Cloud?

A. Splunk and Cisco are seeing the need for the observability world evolve to not just observability but agentic observability...and when we think about agentic observability, there are 3 components: 

Using AI to work smarter and faster 

  • ITSI EventIQ (correlate alerts with AI to cut noise) is GA now 
  • at .conf we also announced more agentic features that will be Alpha by end of the year: (AI Troubleshooting in O11y Cloud and AppDynamics and ITSI AI-directed episode summarization which is alpha today) 

Observe AI agents and infrastructure 

  • AI Agent monitoring in observability cloud 
  • AI Infrastructure monitoring (GA) 
  • AI Agent monitoring with AppDynamics (GA) 

Unify observability and show business impact 

  • Digital Experience Analytics in Observability cloud to gain visibility into user behavior and usage 
  • APM Support for hybrid apps and business transactions- to monitor business transactions with precision 
  • Database monitoring- to accelerate query troubleshooting with deep insights 

All these are scratching the surface of what our teams are doing to meet the agentic moment. 

DayaSCanales_1-1761849884813.png

Q. What's the key difference between 'Splunk observability cloud" vs " Splunk AppDynamics"?

A. Splunk Observability Cloud is designed for modern, cloud-native, and microservices environments, offering a unified platform for metrics, traces, and logs with real-time, OpenTelemetry-native data. Splunk AppDynamics excels in deep Application Performance Monitoring (APM) for traditional, monolithic, and hybrid enterprise applications, providing extensive code-level and business transaction visibility. 

Let's quickly go over our Observability Portfolio and how it all comes together: 

The foundation of this portfolio is the Splunk Platform, which includes Splunk Enterprise and Splunk Cloud and provides an underlying layer for logs analytics, data management, and identity.  

At the application, infrastructure, and user experience layer, we have AppDynamics, which is optimized for 3 tier and n-tier applications and self-managed environments. Observability Cloud on the other hand is optimized for cloud native environments and caters to customers that have a strict requirement for OpenTelemetry.  

ITSI ties these components together with an overall intelligence layer targeted at IT Ops teams. It brings business service monitoring and event analytics, coupled with powerful AI/ML capabilities. 

DayaSCanales_2-1761849884815.png

Q. Is it mandatory to send metrics and traces collected by OpenTelemetry collector to Splunk Observability Cloud while logs go to Splunk Cloud?
Is there flexibility in routing all telemetry types to a single destination?
Are there any technical limitations and how can Log Observer be utilized for troubleshooting?

A. No, the OpenTelemetry Collector offers flexible routing; you can send all telemetry (metrics, traces, logs) to a single destination like Splunk Observability Cloud. If logs go to Splunk Cloud, Log Observer Connect unifies them in Observability Cloud. Log Observer aids troubleshooting by providing contextual log analysis integrated with metrics and traces.

DayaSCanales_3-1761849884814.png

Q. For an organization just starting their journey with Kubernetes monitoring, what would be the absolute first practical step you recommend they take, and how quickly can they expect to see value from implementing a solution like Splunk IM?

A. The first step we recommend organizations take is to choose relevant metrics. Not all data is equally useful. Focus specifically on system and application metrics because they directly impact your system's health and performance. System metrics like CPU and memory utilization, disk I/O, and network traffic provide a baseline view of cluster health. Application-specific metrics, such as node, pod, container availability, and resource usage, provide insights into performance.

So, align these metrics with your business objectives and define collection rates and retention periods for efficient data management. Have all the necessary Splunk products and capabilities in place like Splunk cloud and log observer connect is important as well. These dependencies allow you to bring K8s events and logs right into K8s navigators. That way, once you install the Collector for Kubernetes, you can begin to monitor Kubernetes using navigators. You can find a list of best practices in this blog and instructions on how to install the collector for Kubernetes in our documentation.

DayaSCanales_4-1761849884816.png

Q. Beyond the technical capabilities, what's one piece of advice you'd give to teams looking to foster a more 'observability-driven' culture within their organization when dealing with Kubernetes and cloud infrastructure?

A. Start by making observability a shared responsibility — not just a toolset owned by one team. The biggest pitfall I see is partial buy-in: when even one team isn’t contributing to telemetry, you create data blind spots that break the full picture. The best organizations treat data as a shared language that connects Dev, Sec, and Ops around a common view of service health and customer experience. Encourage teams to collaborate early, automate wherever possible, and treat every incident or deployment as a learning opportunity. That’s when observability shifts from something you monitor to something you practice.

DayaSCanales_5-1761849884817.png

Q. You highlighted the importance of 'Observability Unlocked.' Could you elaborate on one common misconception organization that have about observability in a Kubernetes and cloud environment, and how Splunk IM specifically helps to correct that?

A. One common misconception organizations have about observability in Kubernetes and cloud environments is that simply collecting large volumes of metrics, logs, and traces independently is sufficient to understand and resolve issues. Many believe that more data alone guarantees better visibility and faster problem resolution. However, a singular strategy of more data to gain better visibility can result in redundant or irrelevant data and can still lead to fragmented visibility.

You need a data management strategy that collects the most useful data across all your different tech stacks, teams, and services. Splunk IM provides unified visibility across any environment and stack, combining key metrics, logs, and traces into a single, correlated observability experience. You can learn more about challenges, best practices, and key metrics and tools in this blog.

DayaSCanales_6-1761849884818.png

Q. Can you share a specific, perhaps challenging, use case where Splunk IM's capabilities significantly reduced MTTR (Mean Time To Resolution) or helped a team correlate their SLOs to business impact?

A. Here are 2 specific IM customer stories on the website about Bosch Rexroth and a healthcare software company. Bosch Rexroth was able to tackle lever pricing and having infrastructure monitoring across both IT and OT environments. This means the factory could avoid surcharges for exceeding peak energy consumption levels. With Splunk, the team at Bosch Rexroth AG set up an alert — based on an AI/ML energy consumption forecasting model — to warn operators when the factory reaches the 90% mark for energy consumption. This prevents the factory from exceeding the maximum and paying energy price surcharges.

Yes, there have been multiple incidents, but if I had to pick one, it would be during .conf25. Our application was stable, and we anticipated a modest 20% increase in traffic. However, the actual surge was a staggering 100%, overwhelming our infrastructure. This unexpected spike led to significant CPU and memory utilization, threatening our service-level objectives (SLOs). Fortunately, our Splunk Infrastructure Monitoring (IM) detector promptly alerted us to the anomaly. This early warning enabled us to swiftly add more nodes to the cluster, restoring balance. By leveraging Splunk IM's real-time alerts, we maintained our SLOs and ensured uninterrupted service during the critical event.

DayaSCanales_6-1761849884818.png

 

Get Updates on the Splunk Community!

What the End of Support for Splunk Add-on Builder Means for You

Hello Splunk Community! We want to share an important update regarding the future of the Splunk Add-on Builder ...

Solve, Learn, Repeat: New Puzzle Channel Now Live

Welcome to the Splunk Puzzle PlaygroundIf you are anything like me, you love to solve problems, and what ...

Building Reliable Asset and Identity Frameworks in Splunk ES

 Accurate asset and identity resolution is the backbone of security operations. Without it, alerts are ...