I have a scheduled report that runs at minute 05 for events seen in the last hour on the hour (one full hour of data) for All Time (required for stats). The report is accelerated for All Time. The scheduled report is set with the highest scheduled priority and doesn't run concurrently with other scheduled reports or have a schedule window. After the initial summary (acceleration) is built, the job runs in about 830 seconds.
However, every subsequent report experiences a time creep of a few seconds and never returns to the initial value.
Example scheduled report generation times each hour the report runs:
Duration: 831.608
Duration: 836.902
Duration: 850.259
Duration: 854.068
Duration: 875.153
Duration: 902.589
etc.
I've confirmed the Report Acceleration Summary for this report is functioning correctly and updates every 10 minutes per the default setting.
Given this trend, Splunk will eventually take longer than an hour to generate the report. How do I prevent this from happening?
Update:
After inspecting each job of the hourly report, the time creep appears to be happening in command.stats.execute_output.merge_stats
. The duration for these component are steadily increasing every time scheduled report runs:
dispatch.fetch.rcp.phase_0
dispatch.localSearch
dispatch.stream.local
Hi orion44,
this is just a hunch to explain your observation, you should verify it locally: With every scheduled search, splunk writes the results of the scheduled search into (unix) /opt/splunk/var/run/splunk/dispatch/... On Windows it will be under "Program Files\splunk...". I assume that a search running for 900 seconds every hour will produce a rather large result set. This will be written to disk and my hunch is that the true bottleneck is the throughput of the disk and the slowdown due to the fact that the directory fills. Try to clean the search artifacts by using /opt/splunk/bin/splunk clean-dispatch <some-target-directory-that-the-artifacts-are-moved-to> -1s
and check the execution time after removing the old artifacts.
Oliver
/opt/splunk/var/run/splunk/dispatch/
has 94 files in it for a total of 267 MB. The files for each hourly scheduled search are around 300 KB each. Would this be regarded as a large amount?
Hi orion44, no. If it were somewhere in the GB range, we possibly were onto something. As it is, this should not affect the performance. Still, you could clear the artifacts and see if it makes any difference.
I would suggest the use of "tstats" command (Performs statistical queries on indexed fields in tsidx files). As you are running your queries on datamodels and datamodels create the tsidx files, so I believe this can be an alternative to speed up the results on your reports, but you have to rebuild your report again because tstats should be the first command starting in your query.
check this link for tstats-> https://docs.splunk.com/Documentation/Splunk/7.3.1/SearchReference/Tstats
Here is a link with other commands that you can try as well -> https://docs.splunk.com/Documentation/Splunk/7.3.1/SearchReference/Commandsbycategory#Reports
When you inspect the job and look at the execution costs job over job, is there one section that you are seeing the increase in? I suggest looking at the last three jobs and compare the execution costs to see where the time is being spent and where the increase is being observed between them. You can get to the inspection page by appending your sid to the end of this url and replace the domain, app, and mysid with correct domain, app, and search id: https://**mysplunk.com**/en-US/manager/**app**/job_inspector?sid=**mysid**
Thanks, it appears the most time creep is caused by command.stats.execute_output.merge_stats
. I'm not sure what this exactly does, but it's slowly killing the scheduled report like global warming. Shouldn't the report's acceleration be dealing with anything stats
related?
At this point I would ask to see the query as I have an idea on what that could mean, but seeing the query will ensure I am barking up the right tree.
Here's the "heaviest" part of the query. Field names have been obfuscated:
| stats earliest(_time) as FirstSeen, latest(_time) as LastSeen count values by clientip useragent fieldX fieldY fieldZ
| where LastSeen>=relative_time(now(), "-1h@h") AND LastSeen<=relative_time(now(), "@h")