Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Observability for AI Applications: Troubleshooting Latency

CaitlinHalla
Splunk Employee
Splunk Employee

If you’re working with proprietary company data, you’re probably going to have a locally hosted LLM or many locally hosted LLMs. But how do you understand the performance of those models and whether they’re impacted by other services? In this post, we’ll look at how Splunk Observability Cloud can help you gain this insight into LLM-based applications to troubleshoot and find root causes just as quickly as we can with our other applications and services. 

Troubleshooting Latency in an LLM-based Application

We’ll start by logging in Splunk Observability Cloud, and navigating to Splunk Application Performance Monitoring (APM). We’ll make sure we’re in the correct environment for our application, and set the time range to the last 1 hour: 

CaitlinHalla_8-1758040319035.png

At a high level, our application uses multiple LLMs to answer questions from end users. One of these LLMs is gpt-4o-mini from OpenAI, and the other is an open-source LLM named MistralAI, which we’re self-hosting on Nvidia GPU hardware. The application was built using the Langchain framework, and utilizes a Chroma vector database for Retrieval Augmented Generation (RAG). It’s instrumented with the to automatically capture our telemetry data so we have full visibility into its performance. 

Looking at the screenshot above, it looks like we have a critical alert firing in one of our main services, the auto-prompter service. Let’s take a closer look by selecting that “1 Critical” label next to the service name to open up the alert details:  

CaitlinHalla_9-1758040319042.png

In the alert, we can see that the service is experiencing higher latency than normal. Let’s click on the Troubleshoot hyperlink in the Explore Further section on the lower right side of the screen to see what’s going on for this service in APM: 

CaitlinHalla_10-1758040319047.png

Over in APM, we can see that our auto-prompter service has started experiencing some significant latency spikes. Let’s select the Traces tab and open up a long running trace to see where the latency is originating: 

CaitlinHalla_11-1758040319053.png

In this trace, we can see that spans have been captured for each major step used by our application to respond to the question.   

We first see a GET request to our auto-prompter service to build a prompt, which kicks off the process of sending the request into a couple different LLMs to get a response. There’s a call to our Chroma vector database to look for similar documents, and finally, a data-scrubber service, which uses an LLM to remove any personal identifiable information (PII) from the response. 

From this trace waterfall view, we can see that the bulk of the execution time for this trace is spent calling the LLM.  We can select the specific span where the auto-prompter service invokes the LLM to get a closer look: 

CaitlinHalla_12-1758040319059.png

In the Span Details, we can see from the duration that it took the LLM over 21 seconds to respond to the auto-prompter service.

Let’s easily move over to Splunk Infrastructure Monitoring by selecting the three dots next to the auto-prompter service name in this pane and clicking on the top link to go to the OpenAI Instance Navigator.

This will quickly allow us to have a look at the underlying infrastructure:

CaitlinHalla_13-1758040319064.png

With this dashboard, we can see details related to the OpenAI-compatible models utilized by our auto-prompter service. This includes the total number of input and output tokens used, as well as the tokens used broken down by service and model.  

Nothing unusual appears to be happening here, and the token usage by the auto-prompter service is low. 

Let’s explore a bit more by switching from the overview to the table view, so we can see the other services using the OpenAI framework:

CaitlinHalla_14-1758040319066.png

From this view, we can see all the services utilizing OpenAI-compatible LLMs, along with the model they’re using.  

It looks like another service named otel-genai-zero-code is using the same Mistral LLM, and with 1.5 million output tokens in the past hour, this service is probably consuming the bulk of the available capacity of this LLM’s underlying GPU infrastructure. 

We’ll validate this by navigating to Infrastructure, selecting AI Frameworks, and then opening up the Nvidia GPU navigator

CaitlinHalla_15-1758040319074.png

From the Nvidia GPU navigator, we can clearly see that the GPU resources used by this LLM are nearing 100%.  

We can resolve this “noisy neighbor” issue by either working with the team that manages this service to reduce its usage of the LLM, allocating more GPU resources, or we can pull out our application to its own dedicated GPU hardware. It also would be a good idea to create a detector and alert to proactively notify us if another application ends up using too many tokens in the future before this puts too much strain on our GPU. 

Wrap Up 

Not only does Splunk Observability Cloud provide full visibility into our standard applications and services, it also offers full insight into applications and services that utilize LLMs and Nvidia GPU hardware. We were able to use Splunk Observability Cloud to quickly find the root cause of performance issues within our LLM-based application and quickly resolve the issue to get our application back to an optimally performing state.

Want to implement full observability in your LLM-based applications and services? Try Splunk Observability Cloud free for 14 days!

Resources

Contributors
Get Updates on the Splunk Community!

Data Management Digest – December 2025

Welcome to the December edition of Data Management Digest! As we continue our journey of data innovation, the ...

Index This | What is broken 80% of the time by February?

December 2025 Edition   Hayyy Splunk Education Enthusiasts and the Eternally Curious!    We’re back with this ...

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Hello Splunk Community,   We're thrilled to share an exciting update that will help you manage your data more ...