Transient, inconsistent slowness in dashboard pane...

tlopes · ‎07-16-2025

We have been having some strange performance issues with some of our dashboards and we would like some advice on how to troubleshoot these issues and fix them. Despite the underlying searches being extremely fast, sometimes results will take upwards of 30 seconds to be displayed in the corresponding dashboard panels.

Infrastructure and dashboard details

We are running a distributed on-prem Splunk environment with one search head and a cluster of three indexers. All instances are on version 9.2.2, although we have been able to replicate these issues with a 9.4.2 search head as well.

We have six core dashboards, ranging from simple and static to considerably dynamic and complex. About 95% of the searches in this app’s dashboards are metric-based and use mstats. Each individual search is quite fast, with most searches running in under 0.5s, even in the presence of joins/appends. Most of these searches have a 10s refresh time by default.

Problem

We have been facing a recurring issue where certain panels will sometimes not load for several seconds (10-30 seconds usually). This tends to happen in some of the more complex dashboards, particularly after drilldowns/input interactions – doing so often leads to "Waiting for data" messages displayed inside the panels.

One of two things tends to happen:

The underlying search jobs run successfully but the panels do not display data until the next refresh, which causes the search to re-run; panels behave as normal afterwards:

The pending searches start executing but do not fetch any results for several seconds, which can lead to the same search taking variable amounts of time to execute. Here is an example of the same search taking significantly different amounts of time to run (ran just 27s apart):

Whenever a search takes long to run, the component of the search that takes the longest to run, is, by far, the dispatch.stream.remote.<one_of_the_indexers> component which, to the best of our knowledge, represents the amount of time spent by the search head waiting for data streamed back from an indexer during a distributed search.

We have run load tests consisting of opening our dashboards several times in different tabs simultaneously for prolonged periods of time and monitoring system metrics such as CPU, network, and memory. We were not able to detect any hardware bottlenecks, only a modest increase in the CPU usage and load average for the search head and indexers, which is expected. We have also upgraded the hardware the search head is running on (96 cores, 512 GB RAM) and despite the noticeable performance increase, the problem still occurs occasionally.

We would greatly appreciate the community's assistance in helping us troubleshoot these issues.

richgalloway · ‎07-25-2025

Ten seconds is far too often to refresh a dashboard. Unless you have an automaton monitoring the dashboard and taking action on what it finds, 5 minutes is more reasonable.

It sounds like the dashboard is occasionally encountering periods when there are too many other searches running so it has to wait for resources. There is little a dashboard can do about that (other than not contributing to the problem by refreshing too frequently).

If multiple users are accessing the dashboard at the same time, consider replacing in-line searches with scheduled searches that are invoked by the dashboard using the loadjob or savedsearch command. That will replace multiple executions of the same query with a single execution and each user will have the same view of the data.

---
If this reply helps you, Karma would be appreciated.

Transient, inconsistent slowness in dashboard panels

Classic dashboard

panel

Index This | What’s a riddle wrapped in an enigma?

BORE at .conf25

OpenTelemetry for Legacy Apps? Yes, You Can!

Are you a member of the Splunk Community?