If you’re working with proprietary company data, you’re probably going to have a locally hosted LLM or many locally hosted LLMs. But how do you understand the performance of those models and whether they’re impacted by other services? In this post, we’ll look at how Splunk Observability Cloud can help you gain this insight into LLM-based applications to troubleshoot and find root causes just as quickly as we can with our other applications and services.
Troubleshooting Latency in an LLM-based Application
We’ll start by logging in Splunk Observability Cloud, and navigating to Splunk Application Performance Monitoring (APM). We’ll make sure we’re in the correct environment for our application, and set the time range to the last 1 hour:
At a high level, our application uses multiple LLMs to answer questions from end users. One of these LLMs is gpt-4o-mini from OpenAI, and the other is an open-source LLM named MistralAI, which we’re self-hosting on Nvidia GPU hardware. The application was built using the Langchain framework, and utilizes a Chroma vector database for Retrieval Augmented Generation (RAG). It’s instrumented with the to automatically capture our telemetry data so we have full visibility into its performance.
Looking at the screenshot above, it looks like we have a critical alert firing in one of our main services, the auto-prompter service. Let’s take a closer look by selecting that “1 Critical” label next to the service name to open up the alert details:
In the alert, we can see that the service is experiencing higher latency than normal. Let’s click on the Troubleshoot hyperlink in the Explore Further section on the lower right side of the screen to see what’s going on for this service in APM:
Over in APM, we can see that our auto-prompter service has started experiencing some significant latency spikes. Let’s select the Traces tab and open up a long running trace to see where the latency is originating:
In this trace, we can see that spans have been captured for each major step used by our application to respond to the question.
We first see a GET request to our auto-prompter service to build a prompt, which kicks off the process of sending the request into a couple different LLMs to get a response. There’s a call to our Chroma vector database to look for similar documents, and finally, a data-scrubber service, which uses an LLM to remove any personal identifiable information (PII) from the response.
From this trace waterfall view, we can see that the bulk of the execution time for this trace is spent calling the LLM. We can select the specific span where the auto-prompter service invokes the LLM to get a closer look:
In the Span Details, we can see from the duration that it took the LLM over 21 seconds to respond to the auto-prompter service.
Let’s easily move over to Splunk Infrastructure Monitoring by selecting the three dots next to the auto-prompter service name in this pane and clicking on the top link to go to the OpenAI Instance Navigator.
This will quickly allow us to have a look at the underlying infrastructure:
With this dashboard, we can see details related to the OpenAI-compatible models utilized by our auto-prompter service. This includes the total number of input and output tokens used, as well as the tokens used broken down by service and model.
Nothing unusual appears to be happening here, and the token usage by the auto-prompter service is low.
Let’s explore a bit more by switching from the overview to the table view, so we can see the other services using the OpenAI framework:
From this view, we can see all the services utilizing OpenAI-compatible LLMs, along with the model they’re using.
It looks like another service named otel-genai-zero-code is using the same Mistral LLM, and with 1.5 million output tokens in the past hour, this service is probably consuming the bulk of the available capacity of this LLM’s underlying GPU infrastructure.
We’ll validate this by navigating to Infrastructure, selecting AI Frameworks, and then opening up the Nvidia GPU navigator:
From the Nvidia GPU navigator, we can clearly see that the GPU resources used by this LLM are nearing 100%.
We can resolve this “noisy neighbor” issue by either working with the team that manages this service to reduce its usage of the LLM, allocating more GPU resources, or we can pull out our application to its own dedicated GPU hardware. It also would be a good idea to create a detector and alert to proactively notify us if another application ends up using too many tokens in the future before this puts too much strain on our GPU.
Wrap Up
Not only does Splunk Observability Cloud provide full visibility into our standard applications and services, it also offers full insight into applications and services that utilize LLMs and Nvidia GPU hardware. We were able to use Splunk Observability Cloud to quickly find the root cause of performance issues within our LLM-based application and quickly resolve the issue to get our application back to an optimally performing state.
Want to implement full observability in your LLM-based applications and services? Try Splunk Observability Cloud free for 14 days!
Resources
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.