Solved: How to do stats or top for each column in a table?

nimmos · ‎06-20-2015

Hello, let's see if someone can help with this 🙂

I have 4 fields, 3 which I would like to have sorted and counted in relation to the first one, and then display the top 3 for each.

For example, I have the fields:
category
subcategory
product_id
referrer_url

And I want to display the top 20 categories, and for each category, the top 3 subcategories, top 3 product_id and top 3 referrer_url (each with its counter).

So far I've come with this, which is close, but I still have to figure out how to sort and count 3 of the columns.

<my_search>
| table category subcategory product_id referrer_url 
| stats count(category) AS Count list(subcategory) AS SubCat list(product_id) as PID list(referrer_url) as URL by category 
| sort - Count 
|head 20 
|eval SubCat=mvindex(SubCat, 0, 2) | eval PID=mvindex(PID, 0, 2)| eval URL=mvindex(URL, 0, 2)

That's close, but I want SubCat, PID and URL sorted and counted (top would do it, but seems cannot be inserted into a stats search)

The expected output would be something like this:

(statistics view)

So 20 categories, then for each the top 3 for each column, with its count.

Thanks!

acharlieh · ‎06-30-2015

I'm going to start with my solution, and then explain things:

<my_search> | table category subcategory product_id referrer_url | eval stats="f"
| appendpipe [ top limit=20 showperc=f category | eval stats="a"]
| appendpipe [ where stats="f" 
             | top limit=3 showperc=f subcategory by category 
             | rename count as subcat_count subcategory as subcat 
             | stats list(*) as * by category | foreach * [eval <<FIELD>>=mvjoin('<<FIELD>>',"***")] 
             | eval stats="b" ]
| appendpipe [ where stats="f"
             | top limit=3 showperc=f product_id by category
             | rename count as pid_count product_id as pid 
             | stats list(*) as * by category
             | foreach * [eval <<FIELD>>=mvjoin('<<FIELD>>',"***")]
             | eval stats="c" ]
| appendpipe [ where stats="f" 
             | top limit=3 showperc=f referrer_url by category 
             | rename count as url_count referrer_url as url 
             | stats list(*) as * by category
             | foreach * [eval <<FIELD>>=mvjoin('<<FIELD>>',"***")]
             | eval stats="d" ]
| where stats!="f" | stats first(*) as * by category 
| where stats="a" | fields - stats | sort - count | foreach * [eval <<FIELD>>=split('<<FIELD>>',"***")]

Pretty crazy, n'est-ce pas?

The first line is of course to gather the data that we're reporting on, from your search, and label it as raw data (stats="f"). The table command is indeed necessary, as we are going to be duplicating results quite a bit and do not want to carry the baggage of unnecessary fields beyond this point. (In my test environment using numbers for each of these fields and 86,400 random events in a very dense search, the difference was a search time of ~3 seconds having table here and ~28 seconds without).

The first appendpipe is where we take that raw data and use top to gather the list of the top 20 categories and their counts, we label these as stats="a", and add them to the set of results.

The next three appendpipes, we only want to work with the raw data (where stats="f") and we again use top to find the top 3 values for each of these fields. We rename the fields to be what we want in our final results. We then use the stats command to gather the 3 results of each into a single result with multi-valued fields, We then use foreach to turn each of those multi-valued fields into single string fields (with a separator of 3 stars, this could be changed if *** appears in any of the fields... you just need a string that will not show up in any of the possible values here).

After all of the append pipes, our result set is the raw results, followed by the statistics of the top 20 categories, then the statistics of the top 3 field values for each category for each field. We no longer need the raw data so we throw those records away with the where command. Now, since each set of our stats generated unique fields, with the common field of category of course, we can gather the statistics for each field of each category together using stats.

NOW THE FINAL LINE! We again use where to keep only those categories that were in the top 20 categories. Also the stats field has served its purpose we can throw it away with the fields command. We ensure that all of our categories are in order with a sort, and finally we expand the top 3 results for the other fields back to multivalued fields, again using foreach with an eval.

At a high level, we gather different statistics on all categories and add them to our result set, we then remove the raw data and combine our statistics back together. Hopefully this search helps and the ideas behind it can help you solve other problems as well.

Side Note... I actually also built a slightly different query, where after the first appendpipe we use eventstats to we determine if each of the rows were is in the top 20 categories or not (i.e. if the first appendpipe generated a record for that category or not). The thought was fewer results into the later appendpipe commands, but my splunk at least seemed to actually perform better without such. For reference that line was:

    | eventstats last(stats) as keep by category | where keep!="f" | fields - keep

This also causes some slight differences at the end of the search as well as a result (for example no longer needing to throw away results from categories outside of the top 20) but that seems like very minor points. shrug

LATE ADDITION: So the above answer was featured on the Smart AnSwerS blog series. But since that point I discovered a yet to be documented multireport command from stumbling upon a different answers post that makes this search so much cleaner, (and possibly slightly faster?). So I would be remiss if I didn't share the new search:

<my_search> | fields + category subcategory product_id referrer_url | fields - _* 
| multireport [ top limit=20 showperc=f category ]
              [ top limit=3 showperc=f subcategory by category
                  | rename count as subcat_count subcategory as subcat ]
              [ top limit=3 showperc=f product_id by category 
                  | rename count as pid_count product_id as pid ]
              [ top limit=3 showperc=f referrer_url by category
                  | rename count as url_count referrer_url as url ]
| stats list(*) as * by category | where isnotnull(count) | sort - count

So much shorter!!!! In this search, the two fields commands on the first line is the equivalent to the table command in the original answer, again we're limiting our input to the reduce phase as much as possible, but possibly a bit faster because fields are definitely streaming commands but I'm not sure about table. Next the multireport command then kicks off all of the top commands for us in parallel, and returns a result set with the results of each of the top commands one after the other. (No more where condition to limit us to the original data set needed, and no more where to eliminate the raw results at the end) and then sets those as the results.

The last line is a change that could apply to simplify original solution as well some minor adaptation, We can use stats list(*) once instead of in each subsearch, to gather all of the statistics for each category together once. This works because these results would still be in order from the appendpipe or the multireport command, and no conflicts would be present. (I think my brain was stuck on using stats first(*) when I was working on the original command which requires single strings to be passed around. Also the number of results out of the field based statistics parts are now 3N instead of N (where N is the number of overall categories)). Furthermore, since the count field would only be present on those results that were part of the top 20 categories, we could use that as our where condition to eliminate categories that were not part of the top 20. Do a similar sort of the categories and we're done!

View solution in original post

acharlieh · ‎06-30-2015

I'm going to start with my solution, and then explain things:

<my_search> | table category subcategory product_id referrer_url | eval stats="f"
| appendpipe [ top limit=20 showperc=f category | eval stats="a"]
| appendpipe [ where stats="f" 
             | top limit=3 showperc=f subcategory by category 
             | rename count as subcat_count subcategory as subcat 
             | stats list(*) as * by category | foreach * [eval <<FIELD>>=mvjoin('<<FIELD>>',"***")] 
             | eval stats="b" ]
| appendpipe [ where stats="f"
             | top limit=3 showperc=f product_id by category
             | rename count as pid_count product_id as pid 
             | stats list(*) as * by category
             | foreach * [eval <<FIELD>>=mvjoin('<<FIELD>>',"***")]
             | eval stats="c" ]
| appendpipe [ where stats="f" 
             | top limit=3 showperc=f referrer_url by category 
             | rename count as url_count referrer_url as url 
             | stats list(*) as * by category
             | foreach * [eval <<FIELD>>=mvjoin('<<FIELD>>',"***")]
             | eval stats="d" ]
| where stats!="f" | stats first(*) as * by category 
| where stats="a" | fields - stats | sort - count | foreach * [eval <<FIELD>>=split('<<FIELD>>',"***")]

Pretty crazy, n'est-ce pas?

The first line is of course to gather the data that we're reporting on, from your search, and label it as raw data (stats="f"). The table command is indeed necessary, as we are going to be duplicating results quite a bit and do not want to carry the baggage of unnecessary fields beyond this point. (In my test environment using numbers for each of these fields and 86,400 random events in a very dense search, the difference was a search time of ~3 seconds having table here and ~28 seconds without).

The first appendpipe is where we take that raw data and use top to gather the list of the top 20 categories and their counts, we label these as stats="a", and add them to the set of results.

The next three appendpipes, we only want to work with the raw data (where stats="f") and we again use top to find the top 3 values for each of these fields. We rename the fields to be what we want in our final results. We then use the stats command to gather the 3 results of each into a single result with multi-valued fields, We then use foreach to turn each of those multi-valued fields into single string fields (with a separator of 3 stars, this could be changed if *** appears in any of the fields... you just need a string that will not show up in any of the possible values here).

After all of the append pipes, our result set is the raw results, followed by the statistics of the top 20 categories, then the statistics of the top 3 field values for each category for each field. We no longer need the raw data so we throw those records away with the where command. Now, since each set of our stats generated unique fields, with the common field of category of course, we can gather the statistics for each field of each category together using stats.

NOW THE FINAL LINE! We again use where to keep only those categories that were in the top 20 categories. Also the stats field has served its purpose we can throw it away with the fields command. We ensure that all of our categories are in order with a sort, and finally we expand the top 3 results for the other fields back to multivalued fields, again using foreach with an eval.

At a high level, we gather different statistics on all categories and add them to our result set, we then remove the raw data and combine our statistics back together. Hopefully this search helps and the ideas behind it can help you solve other problems as well.

Side Note... I actually also built a slightly different query, where after the first appendpipe we use eventstats to we determine if each of the rows were is in the top 20 categories or not (i.e. if the first appendpipe generated a record for that category or not). The thought was fewer results into the later appendpipe commands, but my splunk at least seemed to actually perform better without such. For reference that line was:

    | eventstats last(stats) as keep by category | where keep!="f" | fields - keep

This also causes some slight differences at the end of the search as well as a result (for example no longer needing to throw away results from categories outside of the top 20) but that seems like very minor points. shrug

LATE ADDITION: So the above answer was featured on the Smart AnSwerS blog series. But since that point I discovered a yet to be documented multireport command from stumbling upon a different answers post that makes this search so much cleaner, (and possibly slightly faster?). So I would be remiss if I didn't share the new search:

<my_search> | fields + category subcategory product_id referrer_url | fields - _* 
| multireport [ top limit=20 showperc=f category ]
              [ top limit=3 showperc=f subcategory by category
                  | rename count as subcat_count subcategory as subcat ]
              [ top limit=3 showperc=f product_id by category 
                  | rename count as pid_count product_id as pid ]
              [ top limit=3 showperc=f referrer_url by category
                  | rename count as url_count referrer_url as url ]
| stats list(*) as * by category | where isnotnull(count) | sort - count

So much shorter!!!! In this search, the two fields commands on the first line is the equivalent to the table command in the original answer, again we're limiting our input to the reduce phase as much as possible, but possibly a bit faster because fields are definitely streaming commands but I'm not sure about table. Next the multireport command then kicks off all of the top commands for us in parallel, and returns a result set with the results of each of the top commands one after the other. (No more where condition to limit us to the original data set needed, and no more where to eliminate the raw results at the end) and then sets those as the results.

The last line is a change that could apply to simplify original solution as well some minor adaptation, We can use stats list(*) once instead of in each subsearch, to gather all of the statistics for each category together once. This works because these results would still be in order from the appendpipe or the multireport command, and no conflicts would be present. (I think my brain was stuck on using stats first(*) when I was working on the original command which requires single strings to be passed around. Also the number of results out of the field based statistics parts are now 3N instead of N (where N is the number of overall categories)). Furthermore, since the count field would only be present on those results that were part of the top 20 categories, we could use that as our where condition to eliminate categories that were not part of the top 20. Do a similar sort of the categories and we're done!

katharsys · ‎03-27-2019

Almost four years later, and this answer is still giving! Thank you so much!

(Also, Ice Bear is best bear.)

acharlieh · ‎03-28-2019

#IceBearForPresident

katharsys · ‎03-28-2019

Ice bear for President indeed!

nimmos · ‎06-30-2015

This was exactly what I was looking for. Would have never figured it out myself (and still struggling to understand it), so thanks a lot acharlieh!

MuS · ‎06-30-2015

What I most like about this answer: it explains what happens and also provides an alternative solution! Well done @acharlieh !

How to do stats or top for each column in a table?

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...

Are you a member of the Splunk Community?

How to do stats or top for each column in a table?

Splunk Mobile: Your Brand-New Home Screen

Introducing Value Insights (Beta): Understand the Business Impact your organization ...

Enterprise Security (ES) Essentials 8.3 is Now GA — Smarter Detections, Faster ...