Is there a way to produce overall stats "by" multi...

Glenn · ‎02-15-2012

Hi,

We need to calculate mean, median, perc95 and stdev statistics for multiple fields from a huge log every day. Each search takes a long time, and we plan to use summary indexing to record the stats.

We have encountered a problem when trying to come up with the summary search syntax. We originally tried to use a search like this:

eventtype="Livehosts" sourcetype="WLSDealStats" dealToConfirm confirm NOT manualDealResponse | ua2os | geoip client_address | strcat lsHwPlatform "_" lsOperatingSystem "_" lsMaxDelayMillis lsConfigType | stats mean(totalTime) median(totalTime) perc95(totalTime) stdev(totalTime) mean(initialDealResponse) median(initialDealResponse) perc95(initialDealResponse) stdev(initialDealResponse) by lsConfigType web_site_id browser client_address_countryname

However, this produces quite a large table, containing a stats entry for each possible combination of all of the by clauses, eg.

lsConfigType    web_site_id browser         client_address_countryname  mean(totalTime) …
G5_RH5_50       aum         Chrome          Europe                          556         …
G5_RH5_50       aum         Chrome          France                          637.8           …
G5_RH5_50       aum         Chrome          Netherlands                 732.4           …
G5_RH5_50       aum         Chrome          United Kingdom                  677.653846  …
G5_RH5_50       aum         Firefox 1       Europe                          2493.333333 …
G5_RH5_50       aum         Firefox 1       France                          616.857143  …
G5_RH5_50       aum         Firefox 1       Netherlands                 889         …
G5_RH5_50       aum         Firefox 1       United Kingdom                  1252.6          …
G5_RH5_50       aum         Firefox 3       Netherlands                 526.5           …
G5_RH5_50       aum         Firefox 3       United Kingdom                  571.75          …

It is not actually possible to work out the overall mean(totalTime) for say, the Chrome browser type from this information.

What we really need is a table containing only the overall stats for each of the specified fields, not each combination of all possible fields OR a way to run different post-processing over a data set multiple times (I know post processing is available within a dashboard, but remember this is for summary searching, not to mention it is limited to 10k results).

An example could be this, which is like four separate tables merged together:

                                            mean(totalTime) median(totalTime)
web_site_id                 aum         123             234
web_site_id                 sgx             345             456
client_address_countryname  Europe          etc         …
client_address_countryname  France          …         …
client_address_countryname  United Kingdom  …         …
browser                 Chrome          …         …
browser                 Firefox 1   …         …
browser                 Firefox 3   …         …

We can run the search four times, each time specifying a single by clause only, but this seems inefficient as we have to search over the same (large) data set four times just to do different stats calculations. Also it is not scalable - every time someone wants to report on a new field, we have to create new searches for it rather than just add the field to a single search.

sowings · ‎11-14-2012

In the context of 4.3, you should try using 'sistats' to prepare the data for summary indexing. This command carries enough data to view the individual "sides" of your data cube in isolation. The gist is <your_search> | sistats <stats_functions> by <differentiators>. It's true that it will be "gross" when looked at in a table view, but when you retrieve those summary events and pipe them to a "reheat" stats command, the latter can focus on the particular stats it wants to summarize, like your mean(totalTime) above.

To focus on (say) a browser, the search for summary records could start with browser=chrome index=summary search_name="<your_search>" to focus on just Chrome. As you then pipe that to stats to get the mean, it'll be the mean for Chrome across the whole set.

In 5.0, the need for building a large data cube like this are obviated by report acceleration. With this feature, a simpler view of the stats (i.e. focus on just mean) will summarize this data directly to the index in which the events are resident, rather than a summary index. With report acceleration, a data cube of this form is less efficient than targeted stats aggregations. If 5.0 is on your horizon, you might try that instead of the big summary index you've described!

ten_yard_fight · ‎11-14-2012

Glenn, my question is totally unrelated to your above post. But, I noticed you have ua2os in your search query. How did you configure it to get it working? I really need this app to work in my Splunk setup.

Is there a way to produce overall stats "by" multiple fields, without having to run the search once per field, or run multiple post-process stats commands over the same resultset?

Data Persistence in the OpenTelemetry Collector

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever

Community Content Calendar, September edition

Are you a member of the Splunk Community?

Is there a way to produce overall stats "by" multiple fields, without having to run the search once per field, or run multiple post-process stats commands over the same resultset?

Data Persistence in the OpenTelemetry Collector

Introducing Splunk 10.0: Smarter, Faster, and More Powerful Than Ever

Community Content Calendar, September edition