Re: Summary index time grouping for performance

joebensimo · ‎09-18-2013

Is there a significant performance difference in searching summary index aggregate results (result of stats command) grouped by hour (with all summary index events in the 1st second of the hour) or spread out throughout the hour?

In other words, is there likely to be a significant performance difference in performing searches on a summary index created with

| stats
sum(a) as a
sum(b) as b
by date_hour x y z

(which puts all summary index rows/events at the start of each hour) or with

| stats
first(_time) as _time
sum(a) as a
sum(b) as b
by date_hour x y z

(which spreads summary index rows/events out across each hour)???

And in case I didn't make it clear above, I am concerned about the performance of searching the summary index; not generating it.

emotz · ‎09-19-2013

For using summary indexing, and the search that populates it in general, you should use sistats and not stats. The summary index will handle the time, so you don't need to group by date_hour and you don't need the first(_time) either.

You could also just use report acceleration in Splunk 5.x to make this a whole lot simpler too. Create your search, run it, save it, schedule it and click on the accelerate button and everything will be done for you in the background. Then you can run your same search as normal over longer periods of time and get the answer quickly.

joebensimo · ‎09-19-2013

Summary index does not handle the time as I want. It aggregates by whatever time I tell it to -- or by the entire range of the summary-index-generating query.

I group by time because I often need/want to group results by time periods shorter than the period over which the summary index query runs. Eg, I have summary index generating queries that run daily and generate aggregated data by hour.

Report acceleration doesn't work for my queries due to the calculations that are in the building of the summary indexes.

joebensimo · ‎09-19-2013

sistats limits what I can do with the summary index data when I query it. sistats doesn't give me the flexibility I need. Therefore, I use stats.

The main difference between sistats and stats is that sistats keeps/saves the minimum data needed to generate the specific statistics specified, while stats saves the results of each specified statistic.

joebensimo · ‎09-18-2013

Yes, I do use job inspector to learn about my query performance.

And the real test will be for me to try these two variations with my data and environment and see if there is a difference in performance.

I was hoping someone else might have tried this or have some theoretical explanation as to why one might be faster than the other (or explain why it will make no difference).

rturk · ‎09-18-2013

This isn't an answer per se, but have you tried using the job inspector to determine the efficiency of your searches?

Summary index time grouping for performance

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes