I'm indexing a CSV that appears like the following in its raw form:
Filenum,string
1,abc
2,defg
2,abc
3,xyz
3,abc
1,xyz
7,uiop
7,abc
4,defg
5,qazwsx
6,qazwsx
1,uiop
4,abc
etc..
In Splunk both "Filenum" and "String" are correctly being extracted as field names.
I'd like to spit out a table that automatically groups Filenums with two or more matching Strings.
For example, Filenum 1 & 3 can be grouped together since they both have Strings abc & xyz.
Sample desired output:
Filenum 1, 3 abc, xyz
Filenum 1, 7 abc, uiop
Filenum 2, 4 abc, defg
Any ideas?
Thanks!
This isn't exactly what you asked for but, IMHO, is probably even better; try the contingency
command like this:
... | contingency Filenum String
This builds a table like this:
Filenum abc xyz uiop qazwsx defg TOTAL
1 1 1 1 0 0 3
7 1 0 1 0 0 2
4 1 0 0 0 1 2
3 1 1 0 0 0 2
6 0 0 0 1 0 1
5 0 0 0 1 0 1
2 1 0 0 0 1 2
TOTAL 5 2 2 2 2 13
While contingency
is interesting, here's a first pass at a solution that gives what you ask for:
... | stats values(String) as Strings by Filenum
| eval lhs=trim(Filenum)."*".trim(mvjoin(Strings,";"))
| streamstats current=f values(lhs) as rhs | mvexpand rhs
| rex field=rhs "^(?<OtherFilenum>\d+)[*](?<OtherStrings>.+)$"
| makemv delim=";" OtherStrings | fields - lhs rhs
| mvexpand Strings | mvexpand OtherStrings | where Strings==OtherStrings
| stats values(Strings) as Strings by Filenum, OtherFilenum | where mvcount(Strings) >=2
What just happened here?
Well assuming that every row in the csv is a separate event with the case sensitive field names Filenum
and String
, the first line we use stats to gather every String for each Filenum together into a multivalued field called Strings
. Now assuming that ;
does not appear in any string, and Filenum are numeric values, we turn each row into a single string form of the File Number * and all strings separated by semicolons. (Obviously delimiters will have to change if the assumption is not correct. ) By doing this, we can use streamstats and mvexpand to get us all of the comparisons we need to perform (2 to 1; 3 to 2, and 1, 4 to 3, 2, and 1, etc.). The next two lines we use rex and makemv to parse the serialized form of the right side of the comparison back into the OtherFilenum and OtherStrings and since we no longer need the serialized forms we throw them away using fields. Using mvexpand again on both the Strings
and OtherStrings
we get every pairing of strings present in the left hand side and right hand side of the comparison. We then use where to keep only those pairs that match, and then use stats again to gather them up by pairs of FileNums. Finally we use where again to keep only those that have at least 2 different strings match.
There are likely optimizations that could be done here, but this should give you a decent start.
@woodcock Actually with the undocumented noop command in conjunction with multikv mocking CSV data is quite easy... I simply did this:
| noop | stats count | eval _raw="Filenum,String
1,abc
2,defg
2,abc
3,xyz
3,abc
1,xyz
7,uiop
7,abc
4,defg
5,qazwsx
6,qazwsx
1,uiop
4,abc" | multikv | table Filenum, String
Bravo! I sunk about an hour of thinking and playing around with this before I realized that he was essentially asking for contingency
and decided that was good enough. Part of that was spent figuring out a good way to mock the data:
index=* | head 10 | streamstats current=f count AS serial | eval Filenum=serial | eval serialPlus1=serial+1 | eval serialPlus2=serial+2 | eval String= serial . "," . serialPlus1 . "," . serialPlus2 | table Filenum String | makemv delim="," String | mvexpand String | eval String = case(
Filenum=1 AND String=1, "abc",
Filenum=1 AND String=2, "xyz",
Filenum=1 AND String=3, "uiop",
Filenum=2 AND String=2, "abc",
Filenum=2 AND String=3, "defg",
Filenum=3 AND String=3, "abc",
Filenum=3 AND String=4, "xyz",
Filenum=4 AND String=4, "abc",
Filenum=4 AND String=5, "defg",
Filenum=5 AND String=5, "qazwsx",
Filenum=6 AND String=6, "qazwsx",
Filenum=7 AND String=7, "abc",
Filenum=7 AND String=8, "uiop")
| where isnotnull(String)