Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM Applications with OpenTelemetry and Splunk

CaitlinHalla
Splunk Employee
Splunk Employee

If you’re unfamiliar, .conf is Splunk’s premier event where the Splunk community, customers, partners, and employees come together to explore the latest innovations in security, observability, and data-driven operations. This year’s .conf was no different. Filled with keynotes, visionary insights, product announcements, innovations, deep-dive technical sessions, and hands-on labs and workshops, .conf25 was informative, impactful, and energizing.

The standing-room only technical session “Observability for Gen AI: Monitoring LLM Applications with OpenTelemetry and Splunk” developed by Splunkers Derek Mitchell (Staff Observability Strategist) and Sarah Ware (Senior Observability Strategist) was one of these energizing moments at .conf. In the session, Derek walked through how to build a Retrieval Augmented Generation (RAG) application and then treat it like any other production service: observable, measurable, and tunable.

In this post, I’ll walk through the scenario Derek used, how the example app was put together, and how Splunk Observability Cloud provides the visibility needed for LLM performance and cost monitoring in order to successfully build, deploy, manage, and run GenAI applications and services.

From textbook question to GenAI design

Derek started the session with a very relatable, real-world example: a grade-school science question, “What are the four layers of soil?” The trick is that we don’t just want any reasonable answer; we want the textbook-correct answer that lines up with what a teacher expects on homework or a test.

You can absolutely throw that question into a public LLM like ChatGPT and see what comes back. You’ll likely get something that looks like a decent answer, but it won’t match the specific wording or structure that our specific science textbook uses. For example, Derek showed an example of what a ChatGPT response might look like:

CaitlinHalla_0-1764695012019.png

Accurate, but not the exact answer we’re looking for.

Instead, to get the specific answer we want, Derek framed a second approach: build a GenAI application with content sourced from the actual textbook. The idea is to ingest the textbook in PDF format, break it into pages, index those pages as embeddings in a vector database, and then use a RAG pattern so the LLM bases its answer only on context from that textbook.

At a high level, the application flow looks like this:

  1. Take the textbook PDF and treat each page as a separate document.
  2. Use the OpenAI embeddings model to compute a vector for each page.
  3. Store those vectors in a vector database (here, Derek uses Chroma).
  4. When a user asks, “What are the four layers of soil?”, compute an embedding for the question.
  5. Use that question embedding to run a similarity search in the vector database and pull back the most relevant pages.
  6. Send just those pages plus the question to the LLM and let it answer using that context.

Derek contrasted this with the obviously inefficient option of sending the entire PDF to the LLM each time, which wastes tokens, risks hitting context limits, and is more expensive than it needs to be.

Instead, the key pieces of data can be isolated and only the 5 most relevant pages can be sent to the LLM along with our question: “What are the four layers of soil?”

CaitlinHalla_1-1764695012029.png

Building the RAG app: embeddings, vector DB, and LangChain

Under the hood, the session focused on a small but complete RAG application written in Python. The key concepts were, again:

  • Embeddings as numerical representations of the semantic content of each relevant textbook page.
  • A vector database to store and search those embeddings efficiently.
  • A framework (LangChain) to stitch together the prompt, retrieval, and LLM call without a lot of boilerplate.

CaitlinHalla_2-1764695012062.png

Derek emphasized that each page of the textbook is treated as its own document. For each page, the app runs the OpenAI embeddings model, stores the resulting vector in Chroma, and associates that vector with the original text. 

Once that’s in place, the similarity search at query time becomes just another vector DB operation. When a user asks a question, the application follows the same embedding process for the question, queries Chroma for similar pages, and then passes those pages plus the question to the LLM. 

CaitlinHalla_3-1764695012101.png

Instead of hand-building HTTP requests, prompts, and store integrations, the app uses LangChain’s primitives to define chains, connect to Chroma, and integrate the LLM. The net result is that the core logic fits into roughly 25–30 lines of Python code.

Orchestrating the workflow with LangGraph

On top of LangChain, the session introduced LangGraph, a framework for agent orchestration. In the session demo, the agent graph was intentionally simple but illustrates the idea clearly:

  • The textbook, in PDF format, is loaded into memory:

CaitlinHalla_4-1764695012119.png

  • The OpenAI embeddings model is used to calculate an embedding for each page in the document and then loads it into a vector database (in this case, a Chroma):

CaitlinHalla_5-1764695012147.png

  • The database and LLM are initialized; the app specifies how to calculate embeddings for the questions that are asked, where to find the vector database, and which LLM to use – Derek used gpt-4o-mini:

CaitlinHalla_6-1764695012178.png

  • A chat prompt template definition that specifies only to use the following pieces of retried context to answer the question, and also to respond with “I don’t know” if the answer cannot be found:

CaitlinHalla_7-1764695012218.png

Two key functions are used in the application to take a users question, do a similarity search in chroma vector store, take similar documents, along with the user question, send that to the OpenAI LLM and return the response.

Bringing it all together, the application creates a graph with these two functions, invokes it with a question, and responds with the answer equivalent to our textbook answer:

CaitlinHalla_8-1764695012259.png

OpenTelemetry and Splunk Observability Cloud

Once the app works locally, the natural next step is to deploy it and understand how it behaves under real usage. This is where observability comes in.

Derek broke adding in observability into two concrete steps:

  1. Install the OpenTelemetry Collector on the environment where the app runs. In the session, this was as simple as running a curl command to fetch and install the Collector:

CaitlinHalla_9-1764695012290.png

  1. Instrument with the Splunk Distribution of the OpenTelemetry Collector Python by installing the package, running a bootstrap command to add additional instrumentation depending on what packages our app is using, setting a few environment variables to tell OpenTelemetry how to report the data, and finally, starting the application:

CaitlinHalla_10-1764695012339.png

After that, data starts flowing into Splunk Observability Cloud:

CaitlinHalla_11-1764695012353.png

Within the Splunk Observability Cloud Service Map, the main service appears as back-to-school-with-gen-ai. It talks to two different models: the embeddings model and gpt-4o-mini. On the Service Map, you also see the Chroma vector database, surfaced via its SQLite backend.

As soon as a user asks the soil-layers question, you can follow their request through a distributed trace

CaitlinHalla_12-1764695012377.png

That trace shows all the key AI-related interactions, and Splunk Observability Cloud decorates them with icons so they stand out:

  • An embeddings icon for embedding operations.
  • A vectordb icon for the similarity search in Chroma.
  • chat icon for the LLM calls.

When you click into a vectordb span, you see the actual SQL query used to perform the similarity search:

CaitlinHalla_13-1764695012403.png

When you click into a chat span, you see metadata that is especially important for GenAI workloads, like how many tokens were used, what the cost was, and other parameters around that LLM call:

CaitlinHalla_14-1764695012444.png

One detail Derek highlighted is that the SQL query shown in the Chroma span wasn’t written by him at all; it was generated by LangChain. Splunk Observability Cloud still exposes it, which means you can see and investigate queries and behavior coming from your frameworks, not just your own code.

The traces also show the original question and the chunks of textbook content retrieved as context:

CaitlinHalla_15-1764695012453.png

And the final answer produced by the LLM using that context:

CaitlinHalla_16-1764695012458.png

Now we not only have our question and expected response, but we also have a reliable way to gain valuable insight into our GenAI app so we can easily debug and explain what’s going on within it.

Troubleshooting and Optimizing

With visibility into things like token counts and latencies per request, optimization becomes a data-driven exercise instead of just guesswork. In the session, Derek walked through the initial request token count, which we saw previously in Splunk Observability Cloud was using about 925 tokens. To improve both cost and performance, he revisited the retrieve function in the code.

By default, LangChain’s similarity search returns four related documents. For a simple concept like soil layers, four full pages of context may be more than necessary. Derek experimented with the k parameter that controls how many documents to retrieve, dropping k from the default 4 down to 2:

CaitlinHalla_17-1764695012470.png

With that single change, the app:

  • Reduced token usage from around 925 tokens to about 583 tokens per request.
  • Improved latency from roughly 1.9 seconds to about 1.8 seconds.

CaitlinHalla_18-1764695012491.png

Those numbers might look small at first glance, but multiplied across a high volume of requests, the savings add up. Derek also emphasized that there is always a trade-off. For simple subjects like this one, two documents of context might be plenty. For more complex topics, you may need to raise k again to preserve answer quality.

The key point is that Splunk Observability Cloud gives you the visibility to make those decisions intentionally. You can experiment, look at traces and metrics, and then decide whether a change is worth it.

Extending the pattern to real enterprise data

In the Q&A, Derek and attendees stepped away from textbooks and talked about where this pattern fits in typical enterprise environments. Support workflows are an obvious candidate. Most organizations have years of support cases logged and stored. Those cases can be embedded and then stored in a vector database. When a new ticket comes in, an agent (or an assistant) can search for similar cases and quickly see how similar issues were resolved in the past.

The pattern is not limited to PDF formats. There are many different loaders available in common frameworks, including ones that pull content from corporate intranet sites that are not accessible on the public internet. It’s possible to use LangChain to scrape that internal content, embed it, and load it into a vector store so you can use it as grounding context for an assistant.

The same idea also applies to meeting recordings. With speech-to-text, you can convert recordings of team meetings into text, embed that text, and make it queryable through a RAG pipeline. That gives teams a way to ask questions about past decisions or discussions without manually searching through notes.

The message from the session was that once you have a reliable way to load data into a vector store and a clear observability story around the GenAI app, it becomes much easier to apply the same pattern to new domains.

On the observability side, questions came up about which version of the OpenTelemetry Collector to use. The guidance was straightforward: any modern Collector works, whether you run upstream OpenTelemetry or the Splunk Distribution. In practice, teams often pair that with additional instrumentation libraries and LangChain-specific instrumentation, which the Splunk team is working on bringing into the OpenTelemetry project.

Where to go from here

If you want to try this pattern yourself, a good starting point is to pick a narrow but meaningful use case — something like a single PDF runbook or a slice of internal documentation — and build a small RAG prototype on top of it.

From there, you can:

  • Use LangChain to handle embeddings, add a vector store integration (like Chroma), and prompt templates so you’re not reinventing primitives.
  • Introduce LangGraph to model your flow explicitly as a retrieve step followed by a generate step, and then expand as your use case grows.
  • Instrument early with the Splunk Distribution of the OpenTelemetry, and deploy the OpenTelemetry Collector in the same environment as your app.
  • Send telemetry to Splunk Observability Cloud.
  • Experiment with parameters while watching token usage, cost, and latency; use traces and metrics to decide where to land.

If you’re not already using Splunk Observability Cloud, you can start with a free 14-day trial or work with your Splunk representative to get access and replicate a “back-to-school” style demo with your own content.

Resources

 

Want updates like this sent straight to you? Learn how to subscribe to this blog (and follow Labels you care about) in our quick guide. 

Contributors
Get Updates on the Splunk Community!

AI for AppInspect

We’re excited to announce two new updates to AppInspect designed to save you time and make the app approval ...

App Platform's 2025 Year in Review: A Year of Innovation, Growth, and Community

As we step into 2026, it’s the perfect moment to reflect on what an extraordinary year 2025 was for the Splunk ...

Operationalizing Entity Risk Score with Enterprise Security 8.3+

Overview Enterprise Security 8.3 introduces a powerful new feature called “Entity Risk Scoring” (ERS) for ...