If you’re unfamiliar, .conf is Splunk’s premier event where the Splunk community, customers, partners, and employees come together to explore the latest innovations in security, observability, and data-driven operations. This year’s .conf was no different. Filled with keynotes, visionary insights, product announcements, innovations, deep-dive technical sessions, and hands-on labs and workshops, .conf25 was informative, impactful, and energizing.
The standing-room only technical session “Observability for Gen AI: Monitoring LLM Applications with OpenTelemetry and Splunk” developed by Splunkers Derek Mitchell (Staff Observability Strategist) and Sarah Ware (Senior Observability Strategist) was one of these energizing moments at .conf. In the session, Derek walked through how to build a Retrieval Augmented Generation (RAG) application and then treat it like any other production service: observable, measurable, and tunable.
In this post, I’ll walk through the scenario Derek used, how the example app was put together, and how Splunk Observability Cloud provides the visibility needed for LLM performance and cost monitoring in order to successfully build, deploy, manage, and run GenAI applications and services.
Derek started the session with a very relatable, real-world example: a grade-school science question, “What are the four layers of soil?” The trick is that we don’t just want any reasonable answer; we want the textbook-correct answer that lines up with what a teacher expects on homework or a test.
You can absolutely throw that question into a public LLM like ChatGPT and see what comes back. You’ll likely get something that looks like a decent answer, but it won’t match the specific wording or structure that our specific science textbook uses. For example, Derek showed an example of what a ChatGPT response might look like:
Accurate, but not the exact answer we’re looking for.
Instead, to get the specific answer we want, Derek framed a second approach: build a GenAI application with content sourced from the actual textbook. The idea is to ingest the textbook in PDF format, break it into pages, index those pages as embeddings in a vector database, and then use a RAG pattern so the LLM bases its answer only on context from that textbook.
At a high level, the application flow looks like this:
Derek contrasted this with the obviously inefficient option of sending the entire PDF to the LLM each time, which wastes tokens, risks hitting context limits, and is more expensive than it needs to be.
Instead, the key pieces of data can be isolated and only the 5 most relevant pages can be sent to the LLM along with our question: “What are the four layers of soil?”
Under the hood, the session focused on a small but complete RAG application written in Python. The key concepts were, again:
Derek emphasized that each page of the textbook is treated as its own document. For each page, the app runs the OpenAI embeddings model, stores the resulting vector in Chroma, and associates that vector with the original text.
Once that’s in place, the similarity search at query time becomes just another vector DB operation. When a user asks a question, the application follows the same embedding process for the question, queries Chroma for similar pages, and then passes those pages plus the question to the LLM.
Instead of hand-building HTTP requests, prompts, and store integrations, the app uses LangChain’s primitives to define chains, connect to Chroma, and integrate the LLM. The net result is that the core logic fits into roughly 25–30 lines of Python code.
On top of LangChain, the session introduced LangGraph, a framework for agent orchestration. In the session demo, the agent graph was intentionally simple but illustrates the idea clearly:
Two key functions are used in the application to take a users question, do a similarity search in chroma vector store, take similar documents, along with the user question, send that to the OpenAI LLM and return the response.
Bringing it all together, the application creates a graph with these two functions, invokes it with a question, and responds with the answer equivalent to our textbook answer:
Once the app works locally, the natural next step is to deploy it and understand how it behaves under real usage. This is where observability comes in.
Derek broke adding in observability into two concrete steps:
After that, data starts flowing into Splunk Observability Cloud:
Within the Splunk Observability Cloud Service Map, the main service appears as back-to-school-with-gen-ai. It talks to two different models: the embeddings model and gpt-4o-mini. On the Service Map, you also see the Chroma vector database, surfaced via its SQLite backend.
As soon as a user asks the soil-layers question, you can follow their request through a distributed trace:
That trace shows all the key AI-related interactions, and Splunk Observability Cloud decorates them with icons so they stand out:
When you click into a vectordb span, you see the actual SQL query used to perform the similarity search:
When you click into a chat span, you see metadata that is especially important for GenAI workloads, like how many tokens were used, what the cost was, and other parameters around that LLM call:
One detail Derek highlighted is that the SQL query shown in the Chroma span wasn’t written by him at all; it was generated by LangChain. Splunk Observability Cloud still exposes it, which means you can see and investigate queries and behavior coming from your frameworks, not just your own code.
The traces also show the original question and the chunks of textbook content retrieved as context:
And the final answer produced by the LLM using that context:
Now we not only have our question and expected response, but we also have a reliable way to gain valuable insight into our GenAI app so we can easily debug and explain what’s going on within it.
With visibility into things like token counts and latencies per request, optimization becomes a data-driven exercise instead of just guesswork. In the session, Derek walked through the initial request token count, which we saw previously in Splunk Observability Cloud was using about 925 tokens. To improve both cost and performance, he revisited the retrieve function in the code.
By default, LangChain’s similarity search returns four related documents. For a simple concept like soil layers, four full pages of context may be more than necessary. Derek experimented with the k parameter that controls how many documents to retrieve, dropping k from the default 4 down to 2:
With that single change, the app:
Those numbers might look small at first glance, but multiplied across a high volume of requests, the savings add up. Derek also emphasized that there is always a trade-off. For simple subjects like this one, two documents of context might be plenty. For more complex topics, you may need to raise k again to preserve answer quality.
The key point is that Splunk Observability Cloud gives you the visibility to make those decisions intentionally. You can experiment, look at traces and metrics, and then decide whether a change is worth it.
In the Q&A, Derek and attendees stepped away from textbooks and talked about where this pattern fits in typical enterprise environments. Support workflows are an obvious candidate. Most organizations have years of support cases logged and stored. Those cases can be embedded and then stored in a vector database. When a new ticket comes in, an agent (or an assistant) can search for similar cases and quickly see how similar issues were resolved in the past.
The pattern is not limited to PDF formats. There are many different loaders available in common frameworks, including ones that pull content from corporate intranet sites that are not accessible on the public internet. It’s possible to use LangChain to scrape that internal content, embed it, and load it into a vector store so you can use it as grounding context for an assistant.
The same idea also applies to meeting recordings. With speech-to-text, you can convert recordings of team meetings into text, embed that text, and make it queryable through a RAG pipeline. That gives teams a way to ask questions about past decisions or discussions without manually searching through notes.
The message from the session was that once you have a reliable way to load data into a vector store and a clear observability story around the GenAI app, it becomes much easier to apply the same pattern to new domains.
On the observability side, questions came up about which version of the OpenTelemetry Collector to use. The guidance was straightforward: any modern Collector works, whether you run upstream OpenTelemetry or the Splunk Distribution. In practice, teams often pair that with additional instrumentation libraries and LangChain-specific instrumentation, which the Splunk team is working on bringing into the OpenTelemetry project.
If you want to try this pattern yourself, a good starting point is to pick a narrow but meaningful use case — something like a single PDF runbook or a slice of internal documentation — and build a small RAG prototype on top of it.
From there, you can:
If you’re not already using Splunk Observability Cloud, you can start with a free 14-day trial or work with your Splunk representative to get access and replicate a “back-to-school” style demo with your own content.
Want updates like this sent straight to you? Learn how to subscribe to this blog (and follow Labels you care about) in our quick guide.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.