Splunk Search

How to compute stats (mean, median, mode) over a frequency table?

jyzhang
Engager

I have a search which returns the result as frequency table:


uploads frequency
0 6
1 4
2 1
5 1

Basically, 6 users have uploaded 0 times, 4 users uploaded 1 time, and so on.

Now I want to compute stats such as the mean, median, and mode. Is there an easier way to compute stats directly on a frequency table such as this? (ie mean should be 0.916666, and mode is 0) This seems to be something that is required quite often, yet I haven't found a way to do it.

I understand there's way to calculate stats on the data before I turn it into a frequency table, the problem is that the query is rather complex (because in order to include the upload == 0 case, I had to do a lot of extra work). In other words, the underlying logs have no events when there are no uploads, but to calculate accurate statistics, the 0 events must be taken into consideration.

0 Karma
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

If you're frequency table is generated by you running stats count, and then appending the other items, you can do it most efficiently with:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats mean(uploads) median(uploads) mode(uploads)

or, to get back the above frequency table:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats count as frequency by uploads

if you want, you can generate the subsearch into a file with outputcsv or outputlookup, then fetch it back in and append it.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

If you're frequency table is generated by you running stats count, and then appending the other items, you can do it most efficiently with:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats mean(uploads) median(uploads) mode(uploads)

or, to get back the above frequency table:

source=mysourcedata 
| sistats mean(uploads) median(uploads) mode(uploads) by uploads
| append 
  [ stats count as psrsvd_sm_uploads
    | eval psrsvd_ct_uploads=      <number of entries with 0 uploads>
    | eval psrsvd_gc=psrsvf_ct_uploads
    | eval psrsvd_nc_uploads=psrsvd_ct_uploads
    | eval psrsvd_rd_uploads="0e+00"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_sm_uploads=0
    | eval psrsvd_ss_uploads=0
    | eval psrsvd_v=1
    | eval psrsvd_vm_uploads="0;"+tostring(psrsvd_ct_uploads)+";"
    | eval psrsvd_vt_uploads=0
    | eval uploads=0
 | stats count as frequency by uploads

if you want, you can generate the subsearch into a file with outputcsv or outputlookup, then fetch it back in and append it.

gkanapathy
Splunk Employee
Splunk Employee

Note the above will let you calculate mean(), stdev(), count(), and any percentile, along with a few other functions. Adding max(), min(), first(), last(), etc aren't that hard, but left as an exercise.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

No easy way to compute all of these in a single simple query.

mean:

... | stats 
        sum(uploads) as totaluploads 
        sum(frequency) as totalcount 
    | eval mean=totaluploads/totalcount

mode:

... | sort - frequency | head 1 | eval mode=uploads

median:

... | sort uploads 
    | eventstats sum(frequency) as totalcount | eval medianrank=floor((totalcount/2))
    | streamstats global=f current=t window=0
        sum(frequency) as maxrank
    | where (maxrank-frequency <= medianrank) AND (medianrank < frequency)
    | eval median = uploads

trickery to combine them:

... | sort - frequency 
    | eventstats first(uploads) as mode 
    | sort uploads 
    | eventstats 
        sum(frequency) as totalcount 
        sum(uploads) as totaluploads
    | eval mean=totaluploads/totalcount
    | eval medianrank=floor((totalcount/2))
    | streamstats current=t window=0
        sum(frequency) as maxrank
    | eval median=if((maxrank-frequency <= medianrank) AND (medianrank < frequency), uploads, null())
    | where isnotnull(median)

jyzhang
Engager

If that's the case, I'm thinking of writing my own custom command to expand the frequency table... this way I can pipe the results to stats and do mean(..), median(..), p90(..), etc. Unless something like this already exists...

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...