Hi all,
I have a search that runs over eventdata from a website that runs over a few weeks of data. It should return (among other things) the average number of pageviews per user and the standard deviation of this number. I want to create a summary index of this data. I am using two transforming commands to achieve this. The final one of these I replace by its si- variant to create the summary index. However, the results are not the same as when I just run the original search (over the same time range) against the raw data.
The search looks like this:
event="pageview"
| rename ...
| eval Variant = ... , deviceGroup = ...
| stats count as pageview_per_user by userID, Variant, deviceGroup
| sistats dc(userID) as users, sum(pageview_per_user) as pageviews, avg(pageview_per_user) as avg_pv, stdev(pageview_per_user) as std_pv by Variant, deviceGroup
In particular the total number of pageviews is way off when I use the summary index. I am running the search that populates the summary index over a time range: 1 august until -1h@h. It runs every hour.
What could be the problem here, could it be the use the two (si)stats commands?
By the way, I realize I could probably use report acceleration here, but I want to understand summary indexing better.
Best, Jacob
There is nothing wrong with two stats commands, as long as they are aggregating the information you want them to collect. Since the aggregation is running hourly, across a longer time frame, you need to add _time
into the first stats
in order to collect valid comparison data.
Do this across your longer time frame and see how well it matches a pull from your summary for the same time frame.
event="pageview"
| rename ...
| eval Variant = ... , deviceGroup = ...
| bin _time span=1h
| stats count as pageview_per_user by userID, Variant, deviceGroup, _time
| stats count as users, sum(pageview_per_user) as pageviews, avg(pageview_per_user) as avg_pv, stdev(pageview_per_user) as std_pv by Variant, deviceGroup
or
| sistats dc(userID) as users, sum(pageview_per_user) as pageviews, avg(pageview_per_user) as avg_pv, stdev(pageview_per_user) as std_pv by Variant, deviceGroup
looks like you are doing stats over stats.
try and arrange the data according to your needs and use the | collect
command to send the results to summary index.
read here more:
http://docs.splunk.com/Documentation/SplunkCloud/6.6.0/SearchReference/Collect
Thanks! Could you explain a little about why the stats over stats search I wrote gives the wrong results?
And should I just replace sistats by stats, pipe everything to collect, and schedule the search to run every hour (and should I still check "Enable" under summary indexing when I save the serach?).
Sorry for my ignorance.