Solved: How to compute stats (mean, median, mode) over a f...

jyzhang · ‎06-28-2011

I have a search which returns the result as frequency table:



uploads    frequency

  0           6

  1           4

  2           1

  5           1

Basically, 6 users have uploaded 0 times, 4 users uploaded 1 time, and so on.

Now I want to compute stats such as the mean, median, and mode. Is there an easier way to compute stats directly on a frequency table such as this? (ie mean should be 0.916666, and mode is 0) This seems to be something that is required quite often, yet I haven't found a way to do it.

I understand there's way to calculate stats on the data before I turn it into a frequency table, the problem is that the query is rather complex (because in order to include the upload == 0 case, I had to do a lot of extra work). In other words, the underlying logs have no events when there are no uploads, but to calculate accurate statistics, the 0 events must be taken into consideration.

gkanapathy · ‎06-28-2011

If you're frequency table is generated by you running stats count, and then appending the other items, you can do it most efficiently with:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats mean(uploads) median(uploads) mode(uploads)

or, to get back the above frequency table:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats count as frequency by uploads

if you want, you can generate the subsearch into a file with outputcsv or outputlookup, then fetch it back in and append it.

View solution in original post

gkanapathy · ‎06-28-2011

If you're frequency table is generated by you running stats count, and then appending the other items, you can do it most efficiently with:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats mean(uploads) median(uploads) mode(uploads)

or, to get back the above frequency table:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats count as frequency by uploads

if you want, you can generate the subsearch into a file with outputcsv or outputlookup, then fetch it back in and append it.

gkanapathy · ‎06-28-2011

Note the above will let you calculate mean(), stdev(), count(), and any percentile, along with a few other functions. Adding max(), min(), first(), last(), etc aren't that hard, but left as an exercise.

gkanapathy · ‎06-28-2011

No easy way to compute all of these in a single simple query.

mean:

... | stats 
        sum(uploads) as totaluploads 
        sum(frequency) as totalcount 
    | eval mean=totaluploads/totalcount

mode:

... | sort - frequency | head 1 | eval mode=uploads

median:

... | sort uploads 
    | eventstats sum(frequency) as totalcount | eval medianrank=floor((totalcount/2))
    | streamstats global=f current=t window=0
        sum(frequency) as maxrank
    | where (maxrank-frequency <= medianrank) AND (medianrank < frequency)
    | eval median = uploads

trickery to combine them:

... | sort - frequency 
    | eventstats first(uploads) as mode 
    | sort uploads 
    | eventstats 
        sum(frequency) as totalcount 
        sum(uploads) as totaluploads
    | eval mean=totaluploads/totalcount
    | eval medianrank=floor((totalcount/2))
    | streamstats current=t window=0
        sum(frequency) as maxrank
    | eval median=if((maxrank-frequency <= medianrank) AND (medianrank < frequency), uploads, null())
    | where isnotnull(median)

jyzhang · ‎06-28-2011

If that's the case, I'm thinking of writing my own custom command to expand the frequency table... this way I can pipe the results to stats and do mean(..), median(..), p90(..), etc. Unless something like this already exists...

How to compute stats (mean, median, mode) over a frequency table?

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?