Hi,
We need to calculate mean, median, perc95 and stdev statistics for multiple fields from a huge log every day. Each search takes a long time, and we plan to use summary indexing to record the stats.
We have encountered a problem when trying to come up with the summary search syntax. We originally tried to use a search like this:
eventtype="Livehosts" sourcetype="WLSDealStats" dealToConfirm confirm NOT manualDealResponse | ua2os | geoip client_address | strcat lsHwPlatform "_" lsOperatingSystem "_" lsMaxDelayMillis lsConfigType | stats mean(totalTime) median(totalTime) perc95(totalTime) stdev(totalTime) mean(initialDealResponse) median(initialDealResponse) perc95(initialDealResponse) stdev(initialDealResponse) by lsConfigType web_site_id browser client_address_countryname
However, this produces quite a large table, containing a stats entry for each possible combination of all of the by clauses, eg.
lsConfigType web_site_id browser client_address_countryname mean(totalTime) …
G5_RH5_50 aum Chrome Europe 556 …
G5_RH5_50 aum Chrome France 637.8 …
G5_RH5_50 aum Chrome Netherlands 732.4 …
G5_RH5_50 aum Chrome United Kingdom 677.653846 …
G5_RH5_50 aum Firefox 1 Europe 2493.333333 …
G5_RH5_50 aum Firefox 1 France 616.857143 …
G5_RH5_50 aum Firefox 1 Netherlands 889 …
G5_RH5_50 aum Firefox 1 United Kingdom 1252.6 …
G5_RH5_50 aum Firefox 3 Netherlands 526.5 …
G5_RH5_50 aum Firefox 3 United Kingdom 571.75 …
It is not actually possible to work out the overall mean(totalTime) for say, the Chrome browser type from this information.
What we really need is a table containing only the overall stats for each of the specified fields, not each combination of all possible fields OR a way to run different post-processing over a data set multiple times (I know post processing is available within a dashboard, but remember this is for summary searching, not to mention it is limited to 10k results).
An example could be this, which is like four separate tables merged together:
mean(totalTime) median(totalTime)
web_site_id aum 123 234
web_site_id sgx 345 456
client_address_countryname Europe etc …
client_address_countryname France … …
client_address_countryname United Kingdom … …
browser Chrome … …
browser Firefox 1 … …
browser Firefox 3 … …
We can run the search four times, each time specifying a single by clause only, but this seems inefficient as we have to search over the same (large) data set four times just to do different stats calculations. Also it is not scalable - every time someone wants to report on a new field, we have to create new searches for it rather than just add the field to a single search.
In the context of 4.3, you should try using 'sistats' to prepare the data for summary indexing. This command carries enough data to view the individual "sides" of your data cube in isolation. The gist is <your_search> | sistats <stats_functions> by <differentiators>
. It's true that it will be "gross" when looked at in a table view, but when you retrieve those summary events and pipe them to a "reheat" stats command, the latter can focus on the particular stats it wants to summarize, like your mean(totalTime) above.
To focus on (say) a browser, the search for summary records could start with browser=chrome index=summary search_name="<your_search>"
to focus on just Chrome. As you then pipe that to stats to get the mean, it'll be the mean for Chrome across the whole set.
In 5.0, the need for building a large data cube like this are obviated by report acceleration. With this feature, a simpler view of the stats (i.e. focus on just mean) will summarize this data directly to the index in which the events are resident, rather than a summary index. With report acceleration, a data cube of this form is less efficient than targeted stats aggregations. If 5.0 is on your horizon, you might try that instead of the big summary index you've described!
Glenn, my question is totally unrelated to your above post. But, I noticed you have ua2os in your search query. How did you configure it to get it working? I really need this app to work in my Splunk setup.