We have been having some strange performance issues with some of our dashboards and we would like some advice on how to troubleshoot these issues and fix them. Despite the underlying searches being extremely fast, sometimes results will take upwards of 30 seconds to be displayed in the corresponding dashboard panels. Infrastructure and dashboard details We are running a distributed on-prem Splunk environment with one search head and a cluster of three indexers. All instances are on version 9.2.2, although we have been able to replicate these issues with a 9.4.2 search head as well. We have six core dashboards, ranging from simple and static to considerably dynamic and complex. About 95% of the searches in this app’s dashboards are metric-based and use mstats. Each individual search is quite fast, with most searches running in under 0.5s, even in the presence of joins/appends. Most of these searches have a 10s refresh time by default. Problem We have been facing a recurring issue where certain panels will sometimes not load for several seconds (10-30 seconds usually). This tends to happen in some of the more complex dashboards, particularly after drilldowns/input interactions – doing so often leads to "Waiting for data" messages displayed inside the panels. One of two things tends to happen: The underlying search jobs run successfully but the panels do not display data until the next refresh, which causes the search to re-run; panels behave as normal afterwards: The pending searches start executing but do not fetch any results for several seconds, which can lead to the same search taking variable amounts of time to execute. Here is an example of the same search taking significantly different amounts of time to run (ran just 27s apart): Whenever a search takes long to run, the component of the search that takes the longest to run, is, by far, the dispatch.stream.remote.<one_of_the_indexers> component which, to the best of our knowledge, represents the amount of time spent by the search head waiting for data streamed back from an indexer during a distributed search. We have run load tests consisting of opening our dashboards several times in different tabs simultaneously for prolonged periods of time and monitoring system metrics such as CPU, network, and memory. We were not able to detect any hardware bottlenecks, only a modest increase in the CPU usage and load average for the search head and indexers, which is expected. We have also upgraded the hardware the search head is running on (96 cores, 512 GB RAM) and despite the noticeable performance increase, the problem still occurs occasionally. We would greatly appreciate the community's assistance in helping us troubleshoot these issues.
... View more