Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

From GPU to Application: Monitoring Cisco AI Infrastructure with Splunk Observability Cloud

AqibKazi
Splunk Employee
Splunk Employee

AI workloads are different. They demand specialized infrastructure—powerful GPUs, enterprise-grade networking, orchestration platforms that can handle the scale. Organizations investing in AI are turning to Cisco AI-Ready PODs for exactly this reason: purpose-built infrastructure validated for AI workloads, from model training to inference at scale.

But having powerful infrastructure is only half the equation. When your AI application slows down or fails, you need complete visibility across every layer—from individual GPU utilization all the way up to application-level performance. That's where Splunk Observability Cloud comes in. Together, Splunk and Cisco deliver unified observability for AI infrastructure, giving you the insights you need to keep AI workloads running at peak performance.

The Monitoring Challenge for AI Infrastructure

Here's the problem most organizations face: their monitoring tools weren't built for AI workloads. Infrastructure teams have dashboards for servers and networking. Platform teams monitor Kubernetes. Application teams track service performance. But nobody has a unified view that connects GPU utilization to application latency.

This fragmentation creates real problems. When users complain about slow AI responses, where do you start? Is it the model? The orchestration layer? GPU resource contention? Network throughput? Without correlated telemetry, you're switching between tools, exporting data, and piecing together a story from fragments. What should take minutes ends up taking hours.

Cisco AI-Ready PODs solve the infrastructure side of this equation. These are validated reference architectures that combine Cisco UCS servers, NVIDIA GPUs, high-performance networking, and enterprise storage—all tested and optimized for AI workloads. Whether you're running large language models, computer vision pipelines, or retrieval augmented generation applications, Cisco provides the horsepower you need.

Splunk Observability Cloud solves the visibility problem. Our platform ingests telemetry from every component of your Cisco AI infrastructure—GPUs, compute nodes, networking gear, Kubernetes clusters, and the applications running on top—and correlates it in real time. One platform, one unified view, complete context.

How Splunk Monitors Cisco AI-Ready PODs

Splunk has a comprehensive approach to monitoring AI infrastructure. We're not just looking at one layer—we're monitoring the entire stack and correlating the data so you can see how everything connects.

At the infrastructure layer, Splunk collects metrics from Cisco UCS servers, including CPU, memory, and disk performance. We integrate with Cisco Intersight for hardware health and lifecycle management visibility. For networking, we pull telemetry from Cisco Nexus switches to track data flow, bandwidth utilization, and potential bottlenecks.

GPU monitoring is critical for AI workloads, and Splunk provides real-time visibility into GPU utilization, memory consumption, power draw, and temperature across all GPUs in your Cisco AI Pod. This isn't just basic metrics—we're tracking which workloads are using which GPUs, how efficiently they're being utilized, and whether you have resource imbalances that could impact performance.

Moving up the stack, Splunk provides deep visibility into Kubernetes and Red Hat OpenShift environments. We monitor cluster health, node performance, pod lifecycles, and resource allocation. Our OpenShift-specific dashboards show you exactly what's happening in your container orchestration layer, with metrics tailored to how OpenShift manages workloads.

At the application layer, Splunk APM (Application Performance Monitoring) provides distributed tracing for AI workloads. Whether you're running RAG applications, NVIDIA NIM for model inference, or custom AI pipelines, Splunk traces every request through your application stack. We show you exactly where time is being spent—from API calls through embedding generation, vector search, reranking, and LLM inference.

The real power comes from correlation. Splunk links application traces to the underlying infrastructure. When you see a slow LLM response in APM, you can immediately drill down to see GPU utilization on the node where that inference ran. This eliminates the back-and-forth between teams and tools.

 

A Real-World Example: Troubleshooting AI Performance

Let's walk through a real scenario that demonstrates this unified visibility in action.

A company deployed an AI-powered customer service chatbot on Cisco AI-Ready PODs. The application used retrieval augmented generation—combining a knowledge base with a large language model to provide accurate, context-aware responses. Within days of launch, users started reporting slow response times. Queries that should take 2-3 seconds were taking 15-20 seconds. Customer satisfaction scores dropped. The pressure was on to fix it fast.

The team started where most teams start: infrastructure. Using Splunk, they checked the Kubernetes cluster running on the Cisco AI Pod. Six OpenShift nodes, 627 pods, all running normally. CPU and memory looked healthy across the board. Network throughput was well within capacity. From an infrastructure perspective, nothing stood out.

Next, they moved to the application layer. This is where Splunk APM became essential. The APM overview immediately flagged the rag-server service as critical, with latency spiking well above acceptable thresholds. The service map showed the request flow: user query → embedding service → vector search → reranking → LLM inference → response.

Most of the latency was in the LLM component. Specifically, the meta-llama-3.1-8b model was taking 13+ seconds per request. This was the bottleneck.

But why? The model was working—no errors, no crashes. Just slow. The team pulled up a trace to see the details. The trace waterfall showed every step of request processing, with timestamps for each operation. The LLM inference span dominated the timeline. Over 13 seconds spent generating the completion.

Here's where the Splunk infrastructure correlation made the difference. From the trace, the team could see exactly which pod was handling the request and which Kubernetes node it was running on. They clicked through to the infrastructure dashboard for the Cisco AI Pod.

The GPU utilization panel told the story immediately. The Cisco AI Pod had four NVIDIA GPUs—enterprise-grade hardware specifically chosen for LLM inference. But only one GPU was maxed out at 100% utilization. The other three were barely being used.

All inference requests were hitting a single GPU. That GPU was bottlenecked, creating a queue of requests waiting to be processed. Meanwhile, three other perfectly capable GPUs sat idle. The high CPU utilization they'd seen earlier? That was a symptom—the CPU was busy managing the request queue. The root cause was GPU resource imbalance.

The fix was straightforward: reconfigure the LLM deployment to distribute workload across all four GPUs. Response times dropped to sub-second levels. Users were happy. Crisis averted.

Total time from complaint to resolution: under 10 minutes.

Without unified observability, this investigation would have taken far longer. The team would have checked infrastructure, found nothing obvious, escalated to the application team, reviewed logs, maybe restarted services, checked network traces, and eventually—maybe—discovered the GPU imbalance through manual correlation of metrics from multiple tools.

With Splunk monitoring the Cisco AI Pod, they navigated from user complaint to root cause in a single interface. Infrastructure context was right there in the application trace. GPU metrics were correlated with pod performance. Everything connected.

Key Capabilities That Make This Possible

Several Splunk capabilities come together to enable this kind of rapid troubleshooting.

First is unified telemetry collection. Splunk ingests metrics, traces, and logs from every component of your Cisco AI infrastructure. We're not just collecting data—we're normalizing it, enriching it with context, and storing it in a way that makes correlation fast and accurate.

Second is intelligent alerting. The Splunk AI-powered anomaly detection surfaces issues before they become critical. Alerts include full context—what changed, what's correlated, what dependencies might be affected. This means faster triage and more informed decisions.

Third is our approach to distributed tracing. Splunk APM doesn't just show you application-level traces. We link those traces to the infrastructure where the code is running. This infrastructure context is what enables you to go from "the LLM is slow" to "GPU 0 is maxed out while GPUs 1-3 are idle" in seconds.

Finally, there's the user experience. Splunk Observability Cloud is designed for speed. Dashboards load fast. Queries return results quickly. Navigation between infrastructure and applications is seamless. When you're troubleshooting a production issue, every second counts.

Why Unified Observability Matters for AI Workloads

AI infrastructure is fundamentally different from traditional workloads, and it demands a different approach to monitoring.


Traditional applications might bottleneck on CPU, memory, or network. AI workloads add GPU utilization, model inference latency, and vector operations to that mix. The interdependencies are more complex—a slow LLM response could stem from GPU resource contention, inefficient model loading, network latency during embedding retrieval, or dozens of other factors.


This is why unified observability is critical. You need to see how all these layers interact. When GPU 0 maxes out, how does that impact pod performance? When a Kubernetes node hits resource limits, which AI services are affected? When your LLM latency spikes, is it because of the model itself or the infrastructure underneath?


Splunk Observability Cloud correlates these signals in real time, giving you the context you need to answer these questions quickly. This correlation is what turns raw metrics into actionable insights—and what turns hours of troubleshooting into minutes.

Test Drive This Now

Want to see this in action? We've built an interactive demo that walks you through the exact troubleshooting scenario described in this post. You'll navigate through Kubernetes metrics, explore APM traces, and discover the GPU utilization imbalance—all in a live Splunk Observability Cloud environment monitoring a Cisco AI-Ready POD.

[Launch the Interactive Demo →]

The demo takes about 5 minutes and gives you hands-on experience with the Splunk interface, so you can see how the unified observability workflow actually feels in practice.

Getting Started

AI infrastructure monitoring doesn't have to be complicated. With Splunk Observability Cloud and Cisco AI-Ready PODs, you get a proven solution that's ready for production AI workloads.

Learn more about Cisco AI-Ready PODs and how they deliver enterprise-grade infrastructure for AI.

Explore Splunk Observability Cloud and see how unified telemetry accelerates troubleshooting.


Don’t miss the next post. Here’s how to subscribe to this blog and get notified when new content goes live. 

Contributors
Get Updates on the Splunk Community!

Splunk MCP & Agentic AI: Machine Data Without Limits

  Discover how the Splunk Model Context Protocol (MCP) Server can revolutionize the way your organization ...

Finding Based Detections General Availability

Overview  We’ve come a long way, folks, but here in Enterprise Security 8.4 I’m happy to announce Finding ...

Get Your Hands Dirty (and Your Shoes Comfy): The Splunk Experience

Hands-On Learning and Technical Seminars  Sometimes, you just need to see the code. For those looking for a ...