Archive

Calculating percentiles and including "missing" data

Path Finder

I have a set of events that look something like the following:


Machine: A, File: Foo, SizeMB: 10
Machine: A, File: Bar, SizeMB: 100
Machine: B, File: Foo, SizeMB: 10
Machine: C, File: Foo, SizeMB: 10
Machine: D, File: Foo, SizeMB: 10

Not all machines contain all files (Machine A is the only one with File "Bar" in this example).

I'd like to calculate the percentile size for each file, but I want the file to be treated as size 0 when it's not present. Given the above example, if I tried to calculate the 50th percentile size for each file, the calculation would get the 50th percentile of (100), since we only have the one event. What I really want is to get the 50th percentile of (100, 0, 0, 0), but I've yet to come up with a reasonable way to do that. Anyone have any ideas?

Tags (1)
0 Karma

Path Finder

Not that pretty but it works:

index="test" 
| stats count by file 
| sort -count
| map search="search index=test file!=$file$ 
| eval file=\"$file$\" 
| eval SizeMB=0 
| append [search index=test file=$file$ 
| stats count by file,Machine,SizeMB]" 
| table file,SizeMB,Machine
| stats p50(SizeMB) BY file

Replace "test" with the name of your index. Notice that my "file" is lowercase.

alt text

Path Finder

Thanks! Yeah, "map" does work here, although it's really slow, and doesn't appear to work for more than a few hundred files (well, it sat at "Finalizing job" for a few hours before I killed it). I've upvoted but I won't mark the question as solved yet, per @niketnilay's advice in the comments on his answer.

0 Karma

Path Finder

Just to get an idea, roughly how many events, distinct files and computers are you looking to analyze in a typical job? If there are so many files that the iterative approach used by map/append is not performant, maybe we can look into a custom Python script as a lookup that would receive an array with the sizes of the files that are present and the number of distinct computers as input, pad the array with 0's accordingly and calculate the percentile. I'll run some tests...

0 Karma

SplunkTrust
SplunkTrust

Can you please run the following run anywhere search which dummies your data and confirm whether the output matches your expected output (without percentile, I just want to be sure that dummy rows with 0 count are added as per your expectations)?

 |  makeresults
 |  eval data="Machine: A, File: Foo, SizeMB: 10;Machine: A, File: Bar, SizeMB: 100;Machine: B, File: Foo, SizeMB: 10;Machine: C, File: Foo, SizeMB: 10;Machine: D, File: Foo, SizeMB: 10"
 |  makemv data delim=";"
 |  mvexpand data
 |  rename data as _raw
 |  rex "Machine:\s?(?<machine>\w+),\s?File:\s?(?<file>\w+),\s?SizeMB:\s?(?<size>\d+)" max_match=0
 |  dedup file
 |  table file
 |  map search="| makeresults
 |  eval data=\"Machine: A, File: Foo, SizeMB: 10;Machine: A, File: Bar, SizeMB: 100;Machine: B, File: Foo, SizeMB: 10;Machine: C, File: Foo, SizeMB: 10;Machine: D, File: Foo, SizeMB: 10\"
 |  makemv data delim=\";\"
 |  mvexpand data
 |  rename data as _raw
 |  rex \"Machine:\s?(?<machine>\w+),\s?File:\s?(?<file>\w+),\s?SizeMB:\s?(?<size>\d+)\" max_match=0
 |  table file machine size
 |  appendpipe[| makeresults
 |  eval file=\"$file$\", machine=machine, size=0
 |  fields - _time]"
 |  dedup file machine
____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

Path Finder

Yes, this runanywhere search does add the dummy events.

0 Karma

SplunkTrust
SplunkTrust

Can you try the following? Add your current query's base search in the two places <YourBaseSearch>. PS: If you have and double quotes in your base search, you would need to escape the same in in map command using backslash. (PS: I have done the same for rex command inside map with double quotes.) (http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Map)

  <YourBaseSearch>
  |  rex "Machine:\s?(?<machine>\w+),\s?File:\s?(?<file>\w+),\s?SizeMB:\s?(?<size>\d+)" max_match=0
  |  dedup file
  |  table file
  |  map search="<YourBaseSearch>
  |  rex \"Machine:\s?(?<machine>\w+),\s?File:\s?(?<file>\w+),\s?SizeMB:\s?(?<size>\d+)\" max_match=0
  |  table file machine size
  |  appendpipe[| makeresults
  |  eval file=\"$file$\", machine=machine, size=0
  |  fields - _time]"
  |  dedup file machine
____________________________________________
| makeresults | eval message= "Happy Splunking!!!"
0 Karma

Path Finder

This did work! It takes a really long time to run (10+ minutes) with no intermediate results, because of the embedded search, but it works!

Thanks for the help! Map is a really powerful command I haven't explored in the past.

If I discover any faster way of doing this, I'll definitely update.

0 Karma

SplunkTrust
SplunkTrust

@doweaver, if somehow lookup file can be updated you will not need map command, but you have mentioned in one of your comments that you can not know file names beforehand.

The map query as well as appendpipe with dedup command are expensive. Let me convert my previous comment to answer for others to assist. Please do not accept the answer until you find a better one. (But you can definitely up vote if it helped 🙂 ... wink.. wink!!!

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

Path Finder

Definitely! Thanks for all your help today 🙂

0 Karma

Path Finder

What is the expected output for the two file types?

0 Karma

Path Finder

The desired outcome for

stats p50(SizeMB) BY File

Would be:


Foo: 10
Bar: 0

0 Karma

SplunkTrust
SplunkTrust

@doweaver, for your question ...but I want the file to be treated as size 0 when it's not present, do you have sample? In your event all Machines have File present.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

Path Finder

To clarify - a machine can have many files. In this example, Machine A has files "Foo" and "Bar", and the rest only have file "Foo".

0 Karma

SplunkTrust
SplunkTrust

So for the above sample event do you need to add Machine: B, File: Bar, SizeMB: 0 Machine: C, File: Bar, SizeMB: 0 Machine: D, File: Bar, SizeMB: 0, before performing your calculation?

Also is the sample a single event with multiple values or is it multiple event each with single Machine information? Do you need to perform calculation at single event level with multiple values?

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

Path Finder

Adding dummy events like that would be one way to solve it, yes. Not sure if there are others.

Each event is from a single machine, and contains information on a single file. As the events are currently constructed, there aren't any mvfields.

0 Karma

Legend

What you will need is two lookup files. One with list of all possible machines (only one column called machine) and second with a list of all possible files (one column called file). Assuming you events look exactly as you have provided in your question, try something like this...

| your base search that returns "Machine: A, File: Foo, SizeMB: 10"  in each event 
| rex field=x "Machine:\s?(?<machine>\w+),\s?File:\s?(?<file>\w+),\s?SizeMB:\s?(?<size>\d+)" 
| append 
    [| inputlookup machinelist.csv 
    | table machine 
    | eval file= 
        [| inputlookup filelist.csv 
        | mvcombine file delim=","
        | return $file] 
    | makemv file delim="," 
    | mvexpand file]
| stats max(size) as size by machine file
| stats perc50(size) by file

Path Finder

Unfortunately, it's impossible for me to statically define my list of possible machines in a lookup file - there's not a predetermined set. The list of files is statically defined... let me play around with your suggestion and see if I can get it to work with the single lookup. Thanks!

0 Karma