Let’s say I’m running a travel planning AI app in production. A user asks for three concise hotel options in Barcelona, and the app returns a long, padded response with extra details and suggestions nobody asked for. The request succeeded. No 500 error, no spike in latency. But it wasn’t a super great answer.
If I were running a traditional service, my usual APM dashboards wouldn’t tell me much here — the request completed successfully. But AI workloads need a few extra signals: which agent handled the response, how many tokens it burned, which tools and models it called, whether retrieval pulled the right context, and whether the answer itself was any good.
In this post, we’ll walk through how to debug exactly that kind of scenario in Splunk Observability Cloud — from spotting the bad agent to setting up an alert so we don’t find out about the next one from a customer complaint.
Start with the AI agents overview
The AI agents overview page in Splunk APM is a great place to start. It rolls up the signals across every instrumented agent: request and error counts, P90 latency, input and output token totals, and quality scores like hallucination, toxicity, bias, relevance, and sentiment:
Here we can see the flight_specialist and hotel_specialist agents are both flagged Critical with 100% hallucinated responses. The app is responding, but the content is the problem. From here, we can click into individual agents to view related AI trace data.
Trace a bad response
The trace waterfall for one bad response is where the real work happens. Inside a single trace, we get every agent step, LLM call, and tool execution as its own span:
Here, the travel-planner request fans out through a LangGraph workflow into a coordinator agent and a flight_specialist. Each has its own chat spans (LLM calls) and an execute_tool span for the mock_search_flights tool. That’s already enough to answer the questions that matter when something goes sideways like:
For our travel app, the trace might show that the hotel agent made three LLM calls instead of one, that a tool returned more results than the agent could handle, or that an unrelated trip was returned from the vector DB. The fix is different in each case, and without the trace tied to the agent, the problem is anyone’s best guess. And because these LLM and tool spans live inside the same trace as the rest of the request, you’re not jumping between an APM tool and a separate AI dashboard – you can move from a user’s broader transaction straight into the specific model interaction in one place.
Spot quality drift before users do
In a traditional service, a successful response usually means we’re done. With AI workloads, success isn’t the same as quality. The detailed overview for an agent surfaces not only things like requests, errors, and latency, but things like token usage and a quality issues timeline:
Hallucination, toxicity, bias, relevance, and sentiment scores live next to the performance charts. When fewer than 80% of evaluations pass for a given metric, the agent gets flagged with a quality issue so we can spot a model drifting before it shows up in customer feedback.
This drill-down is invaluable when a specific user issue comes up. But you don’t want to wait for an issue – the goal is to catch the problem before a user experiences an issue. Because these are just metrics, we can put detectors on top of them like any other signal in Splunk Observability Cloud:
This detector fires when the hallucination evaluation crosses zero over a one-hour window. In production, we don’t want to find out about a quality regression from a customer complaint. Alerting on quality and risk signals lets us proactively catch the problem the same way we catch a 500 error spike.
Get data flowing in
None of this requires throwing out the observability patterns we already know and love. OpenTelemetry is still the foundation:
OpenTelemetry has GenAI semantic conventions for spans covering inference, embeddings, retrievals, and tool execution, with attributes like operation name, provider, model, and tool name. That’s how the AI agents overview page and the trace filters we saw earlier know what to group on.
For Splunk AI Agent Monitoring, check out the setup docs for AI applications. Basic instrumentation gets you traces, tokens, and tool calls.
To also see the quality scores, the docs walk through enabling DeepEval and setting a few environment variables in your .env file, for example:
OTEL_INSTRUMENTATION_GENAI_EMITTERS=span_metric_event,splunk
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
OTEL_INSTRUMENTATION_GENAI_EVALS_RESULTS_AGGREGATION=true
Note: the Splunk Distribution of the OpenTelemetry Collector handles the host- or Kubernetes-side collection and forwards everything to Splunk Observability Cloud.
Try it yourself
Monitoring AI workloads isn’t a replacement for APM. It’s an extension of it. Latency, errors, throughput, and infrastructure health still matter. But now we also need to know whether the answer was useful, whether the workflow called the right tools, whether token usage spiked, and whether the interaction introduced a security or privacy risk.
If you’re already sending telemetry to Splunk Observability Cloud, instrument one of your AI agents, run a few requests through it, and see what shows up on the AI agents page. The trace data behind a single bad response will tell you a lot more than “the request completed.”
Don’t yet have a Splunk Observability Cloud account? Try it out free for 14-days.
Resources
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.