I have a search which returns the result as frequency table:
uploads frequency
0 6
1 4
2 1
5 1
Now I want to compute stats such as the mean, median, and mode. Is there an easier way to compute stats directly on a frequency table such as this? (ie mean should be 0.916666, and mode is 0) This seems to be something that is required quite often, yet I haven't found a way to do it.
I understand there's way to calculate stats on the data before I turn it into a frequency table, the problem is that the query is rather complex (because in order to include the upload == 0
case, I had to do a lot of extra work). In other words, the underlying logs have no events when there are no uploads, but to calculate accurate statistics, the 0 events must be taken into consideration.
If you're frequency table is generated by you running stats count
, and then appending the other items, you can do it most efficiently with:
source=mysourcedata
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append
[ stats count as psrsvd_sm_uploads
| eval psrsvd_ct_uploads= <number of entries with 0 uploads>
| eval psrsvd_gc=psrsvf_ct_uploads
| eval psrsvd_nc_uploads=psrsvd_ct_uploads
| eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_sm_uploads=0
| eval psrsvd_ss_uploads=0
| eval psrsvd_v=1
| eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_vt_uploads=0
| eval uploads=0
| stats mean(uploads) median(uploads) mode(uploads)
or, to get back the above frequency table:
source=mysourcedata
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append
[ stats count as psrsvd_sm_uploads
| eval psrsvd_ct_uploads= <number of entries with 0 uploads>
| eval psrsvd_gc=psrsvf_ct_uploads
| eval psrsvd_nc_uploads=psrsvd_ct_uploads
| eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_sm_uploads=0
| eval psrsvd_ss_uploads=0
| eval psrsvd_v=1
| eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_vt_uploads=0
| eval uploads=0
| stats count as frequency by uploads
if you want, you can generate the subsearch into a file with outputcsv or outputlookup, then fetch it back in and append it.
If you're frequency table is generated by you running stats count
, and then appending the other items, you can do it most efficiently with:
source=mysourcedata
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append
[ stats count as psrsvd_sm_uploads
| eval psrsvd_ct_uploads= <number of entries with 0 uploads>
| eval psrsvd_gc=psrsvf_ct_uploads
| eval psrsvd_nc_uploads=psrsvd_ct_uploads
| eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_sm_uploads=0
| eval psrsvd_ss_uploads=0
| eval psrsvd_v=1
| eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_vt_uploads=0
| eval uploads=0
| stats mean(uploads) median(uploads) mode(uploads)
or, to get back the above frequency table:
source=mysourcedata
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append
[ stats count as psrsvd_sm_uploads
| eval psrsvd_ct_uploads= <number of entries with 0 uploads>
| eval psrsvd_gc=psrsvf_ct_uploads
| eval psrsvd_nc_uploads=psrsvd_ct_uploads
| eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_sm_uploads=0
| eval psrsvd_ss_uploads=0
| eval psrsvd_v=1
| eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
| eval psrsvd_vt_uploads=0
| eval uploads=0
| stats count as frequency by uploads
if you want, you can generate the subsearch into a file with outputcsv or outputlookup, then fetch it back in and append it.
Note the above will let you calculate mean(), stdev(), count(), and any percentile, along with a few other functions. Adding max(), min(), first(), last(), etc aren't that hard, but left as an exercise.
No easy way to compute all of these in a single simple query.
mean:
... | stats
sum(uploads) as totaluploads
sum(frequency) as totalcount
| eval mean=totaluploads/totalcount
mode:
... | sort - frequency | head 1 | eval mode=uploads
median:
... | sort uploads
| eventstats sum(frequency) as totalcount | eval medianrank=floor((totalcount/2))
| streamstats global=f current=t window=0
sum(frequency) as maxrank
| where (maxrank-frequency <= medianrank) AND (medianrank < frequency)
| eval median = uploads
trickery to combine them:
... | sort - frequency
| eventstats first(uploads) as mode
| sort uploads
| eventstats
sum(frequency) as totalcount
sum(uploads) as totaluploads
| eval mean=totaluploads/totalcount
| eval medianrank=floor((totalcount/2))
| streamstats current=t window=0
sum(frequency) as maxrank
| eval median=if((maxrank-frequency <= medianrank) AND (medianrank < frequency), uploads, null())
| where isnotnull(median)
If that's the case, I'm thinking of writing my own custom command to expand the frequency table... this way I can pipe the results to stats and do mean(..), median(..), p90(..), etc. Unless something like this already exists...