Splunk Search

How do I find count of duplicates along with total events count?

relango
Explorer

I have payload field in my events with duplicate values like

val1
val1
val2
val2
val3

How to do I search for the count of duplicate events (in above e.g 2 with val1,val2) vs count of total events (5)? I am able to find duplicates using search stats count by payload | where count > 1 but can't able to figure out beyond that...

0 Karma
1 Solution

mayurr98
Super Champion

hey

Try this run anywhere search

| makeresults | eval raw="val1" | append [| makeresults | eval raw="val1" ] | append [| makeresults | eval raw="val2"] | append [| makeresults | eval raw="val2" ] | append [| makeresults | eval raw="val3" ] | stats count(raw) as count by raw  | eventstats sum(count) as total_count | where count > 1

Ideally in your environment you should write

basesearch | stats count(payload) as count by payload| eventstats sum(count) as total_count | where count > 1

Let me know if this works!

View solution in original post

wrangler2x
Motivator

I was just looking at another Splunk Answer which was asking something slightly different, and what that person was looking for was to get a running total to the side of the count. I tried to answer that question but it had been closed, with a pointer to this one as being the same question (doesn't look the same to me, but heh...). Here is how you do that:

| makeresults | eval raw="val1 val1 val2 val2 val3 val4 val4 val4" | makemv delim=" " raw | mvexpand raw
| stats count by raw
| where count > 1
| sort - count
| accum count as totalCount

Of course the line beginning with makeresults would be replaced by your base search, as well ast the stats count by raw. You get the counts 3, 2, 2 in this example, and the totalCounts 3, 5, 7.

0 Karma

wrangler2x
Motivator
0 Karma

mayurr98
Super Champion

hey

Try this run anywhere search

| makeresults | eval raw="val1" | append [| makeresults | eval raw="val1" ] | append [| makeresults | eval raw="val2"] | append [| makeresults | eval raw="val2" ] | append [| makeresults | eval raw="val3" ] | stats count(raw) as count by raw  | eventstats sum(count) as total_count | where count > 1

Ideally in your environment you should write

basesearch | stats count(payload) as count by payload| eventstats sum(count) as total_count | where count > 1

Let me know if this works!

wrangler2x
Motivator

You can simplify building raw and get the same results, to wit:

| makeresults | eval raw="val1 val1 val2 val2 val3" | makemv delim=" " raw | mvexpand raw
0 Karma

relango
Explorer

Using above example: I added one more duplicate "val4" to above example and tried this query:

| makeresults | eval raw="val1" | append [| makeresults | eval raw="val1" ] | append [| makeresults | eval raw="val2"] | append [| makeresults | eval raw="val2" ] | append [| makeresults | eval raw="val3" ]| append [| makeresults | eval raw="val4" ] | append [| makeresults | eval raw="val4" ]   | stats count(raw) as count by raw  | eventstats sum(count) as total_count | where count > 1 |eventstats count as dup_count | table dup_count,total_count

It correctly returns dup_count as 3 and total_count 7 , but it is repeated three times for each duplicated value (val1,val2,val3) , but I just need one.

Piping it to | dedup total_count works for this eg: but not for the actual use case in which I am also trying to bucket stats by 1 day.

0 Karma

mayurr98
Super Champion

you are misunderstanding the concept | dedup total_count is just removing duplicates in a column as you have three time 7 now after this it is showing only one time as it removed duplicates whereas |eventstats count as dup_count works at each event/row which means total statistics if you see you have three rows that is why it is returning 3
I hope you understand what I am trying to say
If you are interested to find total payloads and duplicated payloads then try this

| makeresults | eval raw="val1" | append [| makeresults | eval raw="val1" ] | append [| makeresults | eval raw="val2"] | append [| makeresults | eval raw="val2" ] | append [| makeresults | eval raw="val3" ]| append [| makeresults | eval raw="val4" ] | append [| makeresults | eval raw="val4" ]   | stats count(raw) as count by raw   | eventstats sum(count) as total_count | where count > 1 | stats count as duplicated_payload_count first(total_count) as total_payload_count

as there are 7 entries so total_payload_count is 7 and 3 duplicated entries so duplicated_payload_count is three.

0 Karma

mayurr98
Super Champion

hey did you find above query helpful?

0 Karma

relango
Explorer

Thanks that works!. But I actually want to count the duplicates and also bucket it by time. Here what I tried:

basesearch | bin _time span=1d| stats count by payload,_time | eventstats sum(count) as total | where count > 1 | eventstats count as dup_count | table count, dup_count, total

This works, but one problem. The results are repeated for each value of payload and _time. How can I get just one result for each day.

Basically, I am trying to find the percentage of duplicate payloads per day by computing count of duplicate payloads per day and total number of payloads

0 Karma

mayurr98
Super Champion

hey I do not understand what do you want to achieve. You can just provide me sample input like you did for the above question and also provide what output you want.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...