Is there a significant performance difference in searching summary index aggregate results (result of stats command) grouped by hour (with all summary index events in the 1st second of the hour) or spread out throughout the hour?
In other words, is there likely to be a significant performance difference in performing searches on a summary index created with
sum(a) as a
sum(b) as b
by date_hour x y z
(which puts all summary index rows/events at the start of each hour) or with
first(time) as _time
sum(a) as a
sum(b) as b
by datehour x y z
(which spreads summary index rows/events out across each hour)???
And in case I didn't make it clear above, I am concerned about the performance of searching the summary index; not generating it.
For using summary indexing, and the search that populates it in general, you should use sistats and not stats. The summary index will handle the time, so you don't need to group by datehour and you don't need the first(time) either.
You could also just use report acceleration in Splunk 5.x to make this a whole lot simpler too. Create your search, run it, save it, schedule it and click on the accelerate button and everything will be done for you in the background. Then you can run your same search as normal over longer periods of time and get the answer quickly.
Summary index does not handle the time as I want. It aggregates by whatever time I tell it to -- or by the entire range of the summary-index-generating query.
I group by time because I often need/want to group results by time periods shorter than the period over which the summary index query runs. Eg, I have summary index generating queries that run daily and generate aggregated data by hour.
Report acceleration doesn't work for my queries due to the calculations that are in the building of the summary indexes.
sistats limits what I can do with the summary index data when I query it. sistats doesn't give me the flexibility I need. Therefore, I use stats.
The main difference between sistats and stats is that sistats keeps/saves the minimum data needed to generate the specific statistics specified, while stats saves the results of each specified statistic.
Yes, I do use job inspector to learn about my query performance.
And the real test will be for me to try these two variations with my data and environment and see if there is a difference in performance.
I was hoping someone else might have tried this or have some theoretical explanation as to why one might be faster than the other (or explain why it will make no difference).