Splunk Search

Counting duplicate values

ataunk
Explorer

Situation : I have fields sessionId and personName. This session ID has many-to-may mapping with personName.

Need is : I want the count of personName associated with sessionId.

Query I am using : | table sessionId, personName, it gives following

sessionId personName
1. 1234 Name1
2. 1234 Name 2
3. 1234 Name1
4. 1234 Name3
5. 4321 Name1
6. 4321 Name3
7. 4321 Name3

What I want :
sessionId personName count
1. 1234 Name1 2
2. 1234 Name2 1
3. 1234 Name3 1
4. 4321 Name1 1
5. 4321 Name3 2

0 Karma

elliotproebstel
Champion

If I'm correctly understanding your goal, you want to be able to see the top x number of personName values per sessionId. Try this:
your base search | top 2 personName BY sessionId showperc=false

In this example, I used x=2, but you can replace the 2 there with another number and, depending on your use case, perhaps replace it with a token that is set elsewhere.

0 Karma

ataunk
Explorer

sorry, I am not getting the logic of you query. You can ref. above post for detailed explanation, I just posted

0 Karma

elliotproebstel
Champion

Did this work for you? Or did you find another solution?

0 Karma

elliotproebstel
Champion

As I understand it, you want to:
1. Collapse instances where the sessionId and personName are the same, appending to the end of such collapsed instance the count of times they were the same (transforming, for example, seven identical lines of 92S38LP31QG4E93Z0A0 NAME1 into 92S38LP31QG4E93Z0A0 NAME1 7)
2. Sort the resulting events for each sessionId in descending order by the count values
3. Set a threshold of events to retain for each sessionId - top 2, top 3, etc. So if you have a total of five sessionIds in the original data, then you will have a maximum of 5x events retained (possibly less than 5x if, for example, you set a threshold of 3 but you only have 2 personName values for a particular sessionId).

So the logic of the query I posted is quite simple: top does all three at once. It applies a count value of personName instances per sessionId, sorts the events in descending order within each sessionId, and then retains the top x items.

Have you tried applying my query to your data to see if it works? In my sample data, it achieved what I detailed here.

0 Karma

elliotproebstel
Champion

Maybe it would help to have a shared dataset that we could reference to track your goals against the behavior of the query. Here's a run-anywhere command that will generate >75 events with sessionId and personName values:

| makeresults count=5 | eval sessionId=1234, personName="Name1" | append [ | makeresults count=4 | eval sessionId=1234, personName="Name2" ] | append [ | makeresults count=6 | eval sessionId=1234, personName="Name3" ] | append [ | makeresults count=5 | eval sessionId=4321, personName="Name1" ] | append [ | makeresults count=4 | eval sessionId=4321, personName="Name2" ] | append [ | makeresults count=6 | eval sessionId=4321, personName="Name3" ] | append [ | makeresults count=9 | eval sessionId=5678, personName="Name1" ] | append [ | makeresults count=54 | eval sessionId=5678, personName="Name2" ]

If I append to that command the following:
| top 2 personName BY sessionId showperc=false

The result is a table like this:


sessionId personName count
1234 Name3 6
1234 Name1 5
4321 Name3 6
4321 Name1 5
5678 Name2 54
5678 Name1 9

Does that match your goal/requirements?

0 Karma

ataunk
Explorer

I also want to have a where clause -- where I want all the sessionId with highest number of occurrence. And this occurrence value should be configurable.

So, lets say, if I want the occurrence threshold is 3 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

if I want the occurrence threshold is 2 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 4321 Name3 6
4. 4321 Name3 9

So, if there is sessionID (lets say 3243) with just 1 occurrence, then it should not come in the result.

0 Karma

somesoni2
Revered Legend

Try the updated answer.

0 Karma

somesoni2
Revered Legend

You can just replace | table sessionId, personName with | stats count by sessionId, personName in your search.

Update

User following for your sorting and filter requirement.

your base search

| stats count by sessionId, personName 
| where count>3
| sort -count

ataunk
Explorer

I also want to have a where clause -- where I want all the sessionId with highest number of occurrence. And this occurrence value should be configurable.

So, lets say, if I want the occurrence threshold is 3 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

if I want the occurrence threshold is 2 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 4321 Name3 6
4. 4321 Name3 9

So, if there is sessionID (lets say 3243) with just 1 occurrence, then it should not come in the result.

0 Karma

ataunk
Explorer

No, this doesn't help -

It gives something like this , assuming the count is more that 10:
1. 1234 Name1 20
2. 4321 Name2 15

3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

0 Karma

somesoni2
Revered Legend

Change the where clause according to the threshold you want to use.

0 Karma

ataunk
Explorer

this is not helping either. Basically I want the key (sessionId in our case) to be repetitive if there are different personName occurring multiple times

  1. 1234 Name1 20
  2. 1234 Name2 15
  3. 1234 Name3 1
  4. 4321 Name1 2
  5. 4321 Name3 6
  6. 4321 Name3 9
0 Karma

somesoni2
Revered Legend

The query your base search | stats count by sessionId, personName does do that. Could you post the exact query you're using, corresponding output (which I believe have issues) and corresponding expected output?

0 Karma

ataunk
Explorer

92S38LP31QG4E93Z0A0 NAME1
92S38LP31QG4E93Z0A0 NAME1
92S38LP31QG4E93Z0A0 NAME1
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3

1826SSQ98518QAID99A NAME1
1826SSQ98518QAID99A NAME2
1826SSQ98518QAID99A NAME3
1826SSQ98518QAID99A NAME4
1826SSQ98518QAID99A NAME3
1826SSQ98518QAID99A NAME2
1826SSQ98518QAID99A NAME2
1826SSQ98518QAID99A NAME3

0LKJ68OP19A8865A92Q NAME1
0LKJ68OP19A8865A92Q NAME1
0LKJ68OP19A8865A92Q NAME2
0LKJ68OP19A8865A92Q NAME2
0LKJ68OP19A8865A92Q NAME4

92S38LP31QG4E93Z0A0 NAME1 3
92S38LP31QG4E93Z0A0 NAME2 5
92S38LP31QG4E93Z0A0 NAME3 5
1826SSQ98518QAID99A NAME1 1
1826SSQ98518QAID99A NAME2 3
1826SSQ98518QAID99A NAME3 3
1826SSQ98518QAID99A NAME4 1
0LKJ68OP19A8865A92Q NAME1 2
0LKJ68OP19A8865A92Q NAME2 2
0LKJ68OP19A8865A92Q NAME4 1

Now, if I say my occurence count is 2 then I should get

92S38LP31QG4E93Z0A0 NAME2 5
92S38LP31QG4E93Z0A0 NAME3 5

1826SSQ98518QAID99A NAME2 3
1826SSQ98518QAID99A NAME3 2

0LKJ68OP19A8865A92Q NAME1 2
0LKJ68OP19A8865A92Q NAME2 2

Instead what I am getting is --> coz it is just checking the count(end column). But I want to check the occurence count of the sessionID
92S38LP31QG4E93Z0A0 NAME1 3
92S38LP31QG4E93Z0A0 NAME2 5
92S38LP31QG4E93Z0A0 NAME3 5

1826SSQ98518QAID99A NAME2 3
1826SSQ98518QAID99A NAME3 3

0LKJ68OP19A8865A92Q NAME1 2
0LKJ68OP19A8865A92Q NAME2 2

And I am using exact same query you mentioned

| stats count by sessionId, personName
| where count>3
| sort -count

0 Karma

ataunk
Explorer

I also want to have a where clause -- where I want all the sessionId with highest number of occurrence. And this occurrence value should be configurable.

So, lets say, if I want the occurrence threshold is 3 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

if I want the occurrence threshold is 2 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 4321 Name3 6
4. 4321 Name3 9

So, if there is sessionID (lets say 3243) with just 1 occurrence, then it should not come in the result.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In November, the Splunk Threat Research Team had one release of new security content via the Enterprise ...

Index This | Divide 100 by half. What do you get?

November 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Stay Connected: Your Guide to December Tech Talks, Office Hours, and Webinars!

❄️ Celebrate the season with our December lineup of Community Office Hours, Tech Talks, and Webinars! ...