Splunk Search

Is my search correct in finding the duplicate total count of messages based on a single field value?

jdepp
Path Finder

I am trying to display the single total count of all messages for the current day that have the same values for a single field. I had a previous search that did just that and displayed it against the total messages but there is a question on the counts, so I need to display the single duplicate count to overcome any uncertainty.

source="/msg.stats/tcp/999" | where message_type="Log" | eventstats count as total | eventstats count as duplicates by log_id|  stats count(eval(duplicates>1)) as no_duplicates values(total) as total | eval dupPercentage=no_duplicates/total*100

I thought this was working but a team member disputes the dupe stats as being too high. I assumed a simple search to display dupes would clear this up.

source="/msg.stats/tcp/999" | where message_type="Log" | stats count(eval(log_id>1)) as no_duplicates

I want to make sure that I have the searches correct so I can confidently accept the counts.

0 Karma

jdepp
Path Finder

Thanks everyone for your response; before I accept an answer, the other query that the panel has that disagrees with the non dupe count my query and most of your queries all produce:

source="/msg.stats/tcp/999" | where message_type="Log" | stats dc(log_id) as UniqueCount

This produces a count of for example 753119 while the non dupe count from the first query and some of the other queries below produces 398320. As you can see a big difference. will dig deeper thanks

0 Karma

DalJeanis
Legend

Remember, that dc is going to include one copy of each of your duplicate messages as well as the non-duplicated ones.

    source="/msg.stats/tcp/999" message_type="Log" 
 | eventstats count as duplicates by log_id
 | eval ImADuplicate = if(duplicates>1,1,0) 
 | eventstats dc(log_id) as UniqueDupCount by ImADuplicate
 | eval UniqueNonDupCount = nullif(UniqueDupCount*(1-ImADuplicate),0)
 | eval UniqueDupCount = nullif(UniqueDupCount*ImADuplicate,0)
 | stats count as TotCount, 
        first(UniqueDupCount) as UniqueDupCount, 
        first(UniqueNonDupCount) as UniqueNonDupCount, 
        sum(ImADuplicate) as DupCount, 
        dc(log_id) as UniqueCount
 | eval NonDupCount = UniqueCount - UniqueDupCount
 | eval ExtraCount = DupCount - UniqueDupCount
 | table TotCount UniqueCount NonDupCount UniquenonDupCount UniqueDupCount DupCount ExtraCount

Slight mod to get UniqueDupCount correct.

0 Karma

jdepp
Path Finder

wow; I am brain-dead right now, forgive me. Did not realize it would be so difficult get a simple duplicate total count and unique total count.

So if I understand correctly: using some simple numbers. total 1000 messages delivered for today.

100 of these messages have duplicates which would mean a minimum of 200 total duplicate messages if 2 occurrences of the message for a given log_id.

that means 800 of the total do not have dupes.

and finally the total unique messages would be 900 which is the unique + the count of those that have dupes. Correct?

dc(log_id) would give me the total unique count of events that include those that have dupes but only include that event once in the total count - 900

| eventstats count as duplicates by log_id
| eval ImADuplicate = if(duplicates>1,1,0) this should give the me the dupes?

Really appreciate it you taking the time to clarify this.

0 Karma

somesoni2
Revered Legend

Try like this

 source="/msg.stats/tcp/999" | where message_type="Log" 
| stats count by log_id | eval notDup=if(count=1,1,0)
| stats sum(count) as total sum(notDup) as notDup
| eval no_duplicates=total-notDup
| eval dupPercentage=no_duplicates*100/total
0 Karma

jdepp
Path Finder

Thanks appreciate your response.

0 Karma

DalJeanis
Legend

Do something like this to look at some examples of your duplicates, and see if there's another problem of some sort -

source="/msg.stats/tcp/999" message_type="Log" 
| join log_id  
   [search source="/msg.stats/tcp/999" message_type="Log"  
   | stats count as duplicates by log_id
   | search duplicates > 1
   | sort duplicates 
   | head 2
   | table log_id]

it should be pretty obvious if they are non-dups that somehow got the same log_id, or if they are real duplicates.

Finally, in case there's something subtle about your code that isn't working, here's the way I'd code that percentage calculation -

 source="/msg.stats/tcp/999" message_type="Log" 
| eventstats count as total 
| eventstats count as duplicates by log_id
| eval ImADuplicate = if(duplicates>1,1,0)
| stats count as EventCount, sum(ImADuplicate) as DupCount, first(total) as total 
| eval NonDupCount = EventCount - DupCount
| eval DupPercentage=DupCount/EventCount*100
| eval NonDupPercentage=NonDupCount/EventCount*100
| table total EventCount DupCount NonDupCount DupPercentage NonDupPercentage 
0 Karma

rjthibod
Champion

How about this query? Anything where count is greater than 1 has duplicates.

source="/msg.stats/tcp/999" message_type="Log" log_id > 1 | stats count by log_id | sort -count

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...