Splunk Search

Why am I getting different count results using "chart count by field" versus "chart count(field) by field"?

Path Finder

Hello,

I have this raw line:

2016-02-25T15:48:09.762479+01:00 03ucas amavis[1369]: (01369-16) run_av (ClamAV-clamd-stream): p005 p003 p002 p001 INFECTED: Sanesecurity.Jurlbl.51aaae.UNOFFICIAL, Sanesecurity.Jurlbl.51aaae.UNOFFICIAL

If I run this search:

amavis run_av (\(ClamAV-clamd-stream\) OR \(ClamAV-clamd\)) INFECTED "Sanesecurity.Jurlbl.51aaae.UNOFFICIAL" | rex field=_raw "^(?:[^:\n]*:){6}(?:\s+(?P<virus>[A-z.\-\d,\s]+))" | makemv delim=", " virus | chart count by virus

I see:

virus=Sanesecurity.Jurlbl.51aaae.UNOFFICIAL
count=2

If I run this search:

amavis run_av (\(ClamAV-clamd-stream\) OR \(ClamAV-clamd\)) INFECTED "Sanesecurity.Jurlbl.51aaae.UNOFFICIAL" | rex field=_raw "^(?:[^:\n]*:){6}(?:\s+(?P<virus>[A-z.\-\d,\s]+))" | makemv delim=", " virus | chart count(virus) by virus

I see:

virus=Sanesecurity.Jurlbl.51aaae.UNOFFICIAL
count=4

Why do I see count=4 in the second search?

Thank you very much
Best Regards
Marco

1 Solution

SplunkTrust
SplunkTrust

The problem is the count(virus). A lot of people think in general that count(foo) is the same as count but it is not.

count(virus) by foo, will count the number of occurrences of the "virus" field, for each of the values of foo.

count by foo on the other hand will count the number of rows for each of the values of foo.

Since you have multivalue fields going on in your search, you can begin to see the trouble. count(virus) will quite happily count each of the multivalue values as its own "occurence".

count(virus) by virus is even more peculiar. This will consider each value of virus separately in the by virus part but then for the count(virus) part, it has to count up all occurrences that co-occur with that value, including its own value and other multivalue values.

In short, if virus is guaranteed to be single-value, count(virus) by virus does indeed do the same thing as count by virus, but as you see when multivalue fields enter the picture, it really isn't.

as a corollary, count(_raw) by foo is a little insane, because you're forcing Splunk to look at the _raw field, for really no reason.

Another note - it's sometimes intuitive to think count(foo) will count the distinct values of foo but it wont, so remember that distinct_count(foo) aka its shorthand dc(foo) is a different thing entirely.

View solution in original post

SplunkTrust
SplunkTrust

The problem is the count(virus). A lot of people think in general that count(foo) is the same as count but it is not.

count(virus) by foo, will count the number of occurrences of the "virus" field, for each of the values of foo.

count by foo on the other hand will count the number of rows for each of the values of foo.

Since you have multivalue fields going on in your search, you can begin to see the trouble. count(virus) will quite happily count each of the multivalue values as its own "occurence".

count(virus) by virus is even more peculiar. This will consider each value of virus separately in the by virus part but then for the count(virus) part, it has to count up all occurrences that co-occur with that value, including its own value and other multivalue values.

In short, if virus is guaranteed to be single-value, count(virus) by virus does indeed do the same thing as count by virus, but as you see when multivalue fields enter the picture, it really isn't.

as a corollary, count(_raw) by foo is a little insane, because you're forcing Splunk to look at the _raw field, for really no reason.

Another note - it's sometimes intuitive to think count(foo) will count the distinct values of foo but it wont, so remember that distinct_count(foo) aka its shorthand dc(foo) is a different thing entirely.

View solution in original post

Path Finder

Interesting!
Since _time is very accurate, I could use count(_time) in place of count(_raw)... Thank you.

0 Karma

SplunkTrust
SplunkTrust

hehe. No, don't do that either. If ultimately you want to count the rows, just do count. it's always more efficient than any count(foo), even _time.

Plus if you get in the habit of doing count(_time) it'll betray you later when you have some funky report with more than one transforming command, and for the later transforming commands there is no _time field.

State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!