Counting duplicate values

ataunk · ‎10-31-2017

Situation : I have fields sessionId and personName. This session ID has many-to-may mapping with personName.

Need is : I want the count of personName associated with sessionId.

Query I am using : | table sessionId, personName, it gives following

sessionId personName
1. 1234 Name1
2. 1234 Name 2
3. 1234 Name1
4. 1234 Name3
5. 4321 Name1
6. 4321 Name3
7. 4321 Name3

What I want :
sessionId personName count
1. 1234 Name1 2
2. 1234 Name2 1
3. 1234 Name3 1
4. 4321 Name1 1
5. 4321 Name3 2

elliotproebstel · ‎11-07-2017

If I'm correctly understanding your goal, you want to be able to see the top x number of personName values per sessionId. Try this:
your base search | top 2 personName BY sessionId showperc=false

In this example, I used x=2, but you can replace the 2 there with another number and, depending on your use case, perhaps replace it with a token that is set elsewhere.

ataunk · ‎11-07-2017

sorry, I am not getting the logic of you query. You can ref. above post for detailed explanation, I just posted

elliotproebstel · ‎11-15-2017

Did this work for you? Or did you find another solution?

elliotproebstel · ‎11-08-2017

As I understand it, you want to:
1. Collapse instances where the sessionId and personName are the same, appending to the end of such collapsed instance the count of times they were the same (transforming, for example, seven identical lines of 92S38LP31QG4E93Z0A0 NAME1 into 92S38LP31QG4E93Z0A0 NAME1 7)
2. Sort the resulting events for each sessionId in descending order by the count values
3. Set a threshold of events to retain for each sessionId - top 2, top 3, etc. So if you have a total of five sessionIds in the original data, then you will have a maximum of 5x events retained (possibly less than 5x if, for example, you set a threshold of 3 but you only have 2 personName values for a particular sessionId).

So the logic of the query I posted is quite simple: top does all three at once. It applies a count value of personName instances per sessionId, sorts the events in descending order within each sessionId, and then retains the top x items.

Have you tried applying my query to your data to see if it works? In my sample data, it achieved what I detailed here.

elliotproebstel · ‎11-08-2017

Maybe it would help to have a shared dataset that we could reference to track your goals against the behavior of the query. Here's a run-anywhere command that will generate >75 events with sessionId and personName values:

If I append to that command the following:
| top 2 personName BY sessionId showperc=false

The result is a table like this:



sessionId  personName   count

 1234        Name3        6

 1234        Name1        5

 4321        Name3        6

 4321        Name1        5

 5678        Name2        54

 5678        Name1        9

Does that match your goal/requirements?

ataunk · ‎11-02-2017

I also want to have a where clause -- where I want all the sessionId with highest number of occurrence. And this occurrence value should be configurable.

So, lets say, if I want the occurrence threshold is 3 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

if I want the occurrence threshold is 2 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 4321 Name3 6
4. 4321 Name3 9

So, if there is sessionID (lets say 3243) with just 1 occurrence, then it should not come in the result.

somesoni2 · ‎11-02-2017

Try the updated answer.

somesoni2 · ‎10-31-2017

You can just replace | table sessionId, personName with | stats count by sessionId, personName in your search.

Update

User following for your sorting and filter requirement.

your base search

| stats count by sessionId, personName 
| where count>3
| sort -count

ataunk · ‎11-02-2017

I also want to have a where clause -- where I want all the sessionId with highest number of occurrence. And this occurrence value should be configurable.

So, lets say, if I want the occurrence threshold is 3 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

if I want the occurrence threshold is 2 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 4321 Name3 6
4. 4321 Name3 9

So, if there is sessionID (lets say 3243) with just 1 occurrence, then it should not come in the result.

ataunk · ‎11-02-2017

No, this doesn't help -

It gives something like this , assuming the count is more that 10:
1. 1234 Name1 20
2. 4321 Name2 15
3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

somesoni2 · ‎11-02-2017

Change the where clause according to the threshold you want to use.

ataunk · ‎11-07-2017

this is not helping either. Basically I want the key (sessionId in our case) to be repetitive if there are different personName occurring multiple times

1234 Name1 20
1234 Name2 15
1234 Name3 1
4321 Name1 2
4321 Name3 6
4321 Name3 9

somesoni2 · ‎11-07-2017

The query your base search | stats count by sessionId, personName does do that. Could you post the exact query you're using, corresponding output (which I believe have issues) and corresponding expected output?

ataunk · ‎11-07-2017

92S38LP31QG4E93Z0A0 NAME1
92S38LP31QG4E93Z0A0 NAME1
92S38LP31QG4E93Z0A0 NAME1
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME2
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3
92S38LP31QG4E93Z0A0 NAME3

1826SSQ98518QAID99A NAME1
1826SSQ98518QAID99A NAME2
1826SSQ98518QAID99A NAME3
1826SSQ98518QAID99A NAME4
1826SSQ98518QAID99A NAME3
1826SSQ98518QAID99A NAME2
1826SSQ98518QAID99A NAME2
1826SSQ98518QAID99A NAME3

0LKJ68OP19A8865A92Q NAME1
0LKJ68OP19A8865A92Q NAME1
0LKJ68OP19A8865A92Q NAME2
0LKJ68OP19A8865A92Q NAME2
0LKJ68OP19A8865A92Q NAME4

92S38LP31QG4E93Z0A0 NAME1 3
92S38LP31QG4E93Z0A0 NAME2 5
92S38LP31QG4E93Z0A0 NAME3 5
1826SSQ98518QAID99A NAME1 1
1826SSQ98518QAID99A NAME2 3
1826SSQ98518QAID99A NAME3 3
1826SSQ98518QAID99A NAME4 1
0LKJ68OP19A8865A92Q NAME1 2
0LKJ68OP19A8865A92Q NAME2 2
0LKJ68OP19A8865A92Q NAME4 1

Now, if I say my occurence count is 2 then I should get

92S38LP31QG4E93Z0A0 NAME2 5
92S38LP31QG4E93Z0A0 NAME3 5

1826SSQ98518QAID99A NAME2 3
1826SSQ98518QAID99A NAME3 2

0LKJ68OP19A8865A92Q NAME1 2
0LKJ68OP19A8865A92Q NAME2 2

Instead what I am getting is --> coz it is just checking the count(end column). But I want to check the occurence count of the sessionID
92S38LP31QG4E93Z0A0 NAME1 3
92S38LP31QG4E93Z0A0 NAME2 5
92S38LP31QG4E93Z0A0 NAME3 5

1826SSQ98518QAID99A NAME2 3
1826SSQ98518QAID99A NAME3 3

0LKJ68OP19A8865A92Q NAME1 2
0LKJ68OP19A8865A92Q NAME2 2

And I am using exact same query you mentioned

| stats count by sessionId, personName
| where count>3
| sort -count

ataunk · ‎11-02-2017

I also want to have a where clause -- where I want all the sessionId with highest number of occurrence. And this occurrence value should be configurable.

So, lets say, if I want the occurrence threshold is 3 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 1234 Name3 1
4. 4321 Name1 2
5. 4321 Name3 6
6. 4321 Name3 9

if I want the occurrence threshold is 2 I should all the below three
1. 1234 Name1 20
2. 1234 Name2 15
3. 4321 Name3 6
4. 4321 Name3 9

So, if there is sessionID (lets say 3243) with just 1 occurrence, then it should not come in the result.

Counting duplicate values

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

ATTENTION: We’re Moving! (AGAIN!)

Deep Dive: Optimizing Telemetry Pipelines in Splunk Observability Cloud

Announcing Modern Navigation: A New Era of Splunk User Experience

Join the Conversation

Counting duplicate values

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

ATTENTION: We’re Moving! (AGAIN!)

Deep Dive: Optimizing Telemetry Pipelines in Splunk Observability Cloud

Announcing Modern Navigation: A New Era of Splunk User Experience