Group CSV data results by two or more matching val...

jtsplunk · ‎04-11-2012

I'm indexing a CSV that appears like the following in its raw form:

Filenum,string
1,abc
2,defg
2,abc
3,xyz
3,abc
1,xyz
7,uiop
7,abc
4,defg
5,qazwsx
6,qazwsx
1,uiop
4,abc

etc..

In Splunk both "Filenum" and "String" are correctly being extracted as field names.

I'd like to spit out a table that automatically groups Filenums with two or more matching Strings.

For example, Filenum 1 & 3 can be grouped together since they both have Strings abc & xyz.

Sample desired output:

Filenum     1, 3     abc, xyz
Filenum     1, 7     abc, uiop
Filenum     2, 4     abc, defg

Any ideas?

Thanks!

woodcock · ‎07-05-2015

This isn't exactly what you asked for but, IMHO, is probably even better; try the contingency command like this:

... | contingency Filenum String

This builds a table like this:

Filenum abc xyz uiop    qazwsx  defg    TOTAL
1            1      1       1         0       0       3
7            1      0       1         0       0       2
4            1      0       0         0       1       2
3            1      1       0         0       0       2
6            0      0       0         1       0       1
5            0      0       0         1       0       1
2            1      0       0         0       1       2
TOTAL       5      2       2         2       2      13

acharlieh · ‎07-05-2015

While contingency is interesting, here's a first pass at a solution that gives what you ask for:

... | stats values(String) as Strings by Filenum
| eval lhs=trim(Filenum)."*".trim(mvjoin(Strings,";"))
| streamstats current=f values(lhs) as rhs | mvexpand rhs
| rex field=rhs "^(?<OtherFilenum>\d+)[*](?<OtherStrings>.+)$"
| makemv delim=";" OtherStrings | fields - lhs rhs
| mvexpand Strings | mvexpand OtherStrings | where Strings==OtherStrings
| stats values(Strings) as Strings by Filenum, OtherFilenum | where mvcount(Strings) >=2

What just happened here?

Well assuming that every row in the csv is a separate event with the case sensitive field names Filenum and String , the first line we use stats to gather every String for each Filenum together into a multivalued field called Strings. Now assuming that ; does not appear in any string, and Filenum are numeric values, we turn each row into a single string form of the File Number * and all strings separated by semicolons. (Obviously delimiters will have to change if the assumption is not correct. ) By doing this, we can use streamstats and mvexpand to get us all of the comparisons we need to perform (2 to 1; 3 to 2, and 1, 4 to 3, 2, and 1, etc.). The next two lines we use rex and makemv to parse the serialized form of the right side of the comparison back into the OtherFilenum and OtherStrings and since we no longer need the serialized forms we throw them away using fields. Using mvexpand again on both the Strings and OtherStrings we get every pairing of strings present in the left hand side and right hand side of the comparison. We then use where to keep only those pairs that match, and then use stats again to gather them up by pairs of FileNums. Finally we use where again to keep only those that have at least 2 different strings match.

There are likely optimizations that could be done here, but this should give you a decent start.

acharlieh · ‎07-05-2015

@woodcock Actually with the undocumented noop command in conjunction with multikv mocking CSV data is quite easy... I simply did this:

| noop | stats count | eval _raw="Filenum,String
 1,abc
 2,defg
 2,abc
 3,xyz
 3,abc
 1,xyz
 7,uiop
 7,abc
 4,defg
 5,qazwsx
 6,qazwsx
 1,uiop
 4,abc" | multikv | table Filenum, String

woodcock · ‎07-05-2015

Bravo! I sunk about an hour of thinking and playing around with this before I realized that he was essentially asking for contingency and decided that was good enough. Part of that was spent figuring out a good way to mock the data:

index=* | head 10 | streamstats current=f count AS serial | eval Filenum=serial | eval serialPlus1=serial+1 | eval serialPlus2=serial+2 | eval String= serial . "," . serialPlus1 . "," . serialPlus2 | table Filenum String | makemv delim="," String | mvexpand String | eval String = case(
Filenum=1 AND String=1, "abc", 
Filenum=1 AND String=2, "xyz", 
Filenum=1 AND String=3, "uiop", 
Filenum=2 AND String=2, "abc", 
Filenum=2 AND String=3, "defg", 
Filenum=3 AND String=3, "abc", 
Filenum=3 AND String=4, "xyz", 
Filenum=4 AND String=4, "abc", 
Filenum=4 AND String=5, "defg", 
Filenum=5 AND String=5, "qazwsx", 
Filenum=6 AND String=6, "qazwsx", 
Filenum=7 AND String=7, "abc", 
Filenum=7 AND String=8, "uiop")
| where isnotnull(String)

Group CSV data results by two or more matching values?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Group CSV data results by two or more matching values?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits