Splunk Search

How to compare values between two different groupings (multivalue fields) to see what percentage of values are similar?

Path Finder

I'm looking to compare two groups of values from a data sample like this.

Group, User

Group1, User1
Group1, User2
Group2, User1
Group2, User3

My goal is to note how similar one group is to another. I know I can do a stats command of values(User) by Group to get a listing of each group with a field that has all the users included. I'd like to compare Group1 and Group2 to be able to say they have 75% similar values or something like that.

Not sure how to do it. My initial thought was to do something like appending the full contents of the Group I am trying to compare to every event and then performing a comparison between the User and that added field, but I don't know how to add that to every event. Doesn't seem like eventstats would work there.

0 Karma

Motivator

How about this? I assume you already have the Group and User field extracted? The first portion just generates some data, you probably want to start with | stats ...

| gentimes start=-30 | eval Group = "group" . (random() % 2)  | eval User = "user" . (random() % 9) | stats values(Group) dc(Group) AS group_count by User | eventstats count(eval(group_count==2)) AS matched count AS total |  eval percentage_overlap=round((matched/total)*100,2)

The stats command displays the Group names for each user, then counts up the number of distinct Groups each user belongs to. Then we use eventstats to count up the number of users who belong to both groups and the total number of users. Finally, calculate the percentage of overlap.

0 Karma

Path Finder

Thanks! This is pretty close to what I am looking to do. This creates by user, while my actual goal (which I may not have explained well enough) is to compare groups to see how similar they are. I actually have a list of hundreds of groups. I flipped the group and user fields in the search which causes the match not to work anymore. I'm assuming that is because the eventstats count(eval(group_count==2)) syntax is assuming 2 groups.

Any ideas how I might do something similar to what you've created where it's a bunch of groups?

0 Karma

Motivator

It does show you rows by User, but the percentage value is actually how close the two groups match (ie, what percentage of their users overlap). So if you have multiple groups, what would you expect the output to be? If I have 3 Groups, for example, would you expect a table of groups that shows the percentage match with every other group? Something like:

                  Group1  Group2 Group3 

Group1         100%     50%        20% 
Group2           50%     100%       35%
Group3           20%      35%        100%
0 Karma

Path Finder

That would definitely work. I was thinking of a way to do it in Perl and I would probably create an output like you just described.

0 Karma