Re: Is my search correct in finding the duplicate ...

jdepp · ‎01-18-2017

I am trying to display the single total count of all messages for the current day that have the same values for a single field. I had a previous search that did just that and displayed it against the total messages but there is a question on the counts, so I need to display the single duplicate count to overcome any uncertainty.

source="/msg.stats/tcp/999" | where message_type="Log" | eventstats count as total | eventstats count as duplicates by log_id|  stats count(eval(duplicates>1)) as no_duplicates values(total) as total | eval dupPercentage=no_duplicates/total*100

I thought this was working but a team member disputes the dupe stats as being too high. I assumed a simple search to display dupes would clear this up.

source="/msg.stats/tcp/999" | where message_type="Log" | stats count(eval(log_id>1)) as no_duplicates

I want to make sure that I have the searches correct so I can confidently accept the counts.

jdepp · ‎01-18-2017

Thanks everyone for your response; before I accept an answer, the other query that the panel has that disagrees with the non dupe count my query and most of your queries all produce:

source="/msg.stats/tcp/999" | where message_type="Log" | stats dc(log_id) as UniqueCount

This produces a count of for example 753119 while the non dupe count from the first query and some of the other queries below produces 398320. As you can see a big difference. will dig deeper thanks

DalJeanis · ‎01-18-2017

Remember, that dc is going to include one copy of each of your duplicate messages as well as the non-duplicated ones.

    source="/msg.stats/tcp/999" message_type="Log" 
 | eventstats count as duplicates by log_id
 | eval ImADuplicate = if(duplicates>1,1,0) 
 | eventstats dc(log_id) as UniqueDupCount by ImADuplicate
 | eval UniqueNonDupCount = nullif(UniqueDupCount*(1-ImADuplicate),0)
 | eval UniqueDupCount = nullif(UniqueDupCount*ImADuplicate,0)
 | stats count as TotCount, 
        first(UniqueDupCount) as UniqueDupCount, 
        first(UniqueNonDupCount) as UniqueNonDupCount, 
        sum(ImADuplicate) as DupCount, 
        dc(log_id) as UniqueCount
 | eval NonDupCount = UniqueCount - UniqueDupCount
 | eval ExtraCount = DupCount - UniqueDupCount
 | table TotCount UniqueCount NonDupCount UniquenonDupCount UniqueDupCount DupCount ExtraCount

Slight mod to get UniqueDupCount correct.

jdepp · ‎01-19-2017

wow; I am brain-dead right now, forgive me. Did not realize it would be so difficult get a simple duplicate total count and unique total count.

So if I understand correctly: using some simple numbers. total 1000 messages delivered for today.

100 of these messages have duplicates which would mean a minimum of 200 total duplicate messages if 2 occurrences of the message for a given log_id.

that means 800 of the total do not have dupes.

and finally the total unique messages would be 900 which is the unique + the count of those that have dupes. Correct?

dc(log_id) would give me the total unique count of events that include those that have dupes but only include that event once in the total count - 900

| eventstats count as duplicates by log_id
| eval ImADuplicate = if(duplicates>1,1,0) this should give the me the dupes?

Really appreciate it you taking the time to clarify this.

somesoni2 · ‎01-18-2017

Try like this

 source="/msg.stats/tcp/999" | where message_type="Log" 
| stats count by log_id | eval notDup=if(count=1,1,0)
| stats sum(count) as total sum(notDup) as notDup
| eval no_duplicates=total-notDup
| eval dupPercentage=no_duplicates*100/total

jdepp · ‎01-18-2017

Thanks appreciate your response.

DalJeanis · ‎01-18-2017

Do something like this to look at some examples of your duplicates, and see if there's another problem of some sort -

source="/msg.stats/tcp/999" message_type="Log" 
| join log_id  
   [search source="/msg.stats/tcp/999" message_type="Log"  
   | stats count as duplicates by log_id
   | search duplicates > 1
   | sort duplicates 
   | head 2
   | table log_id]

it should be pretty obvious if they are non-dups that somehow got the same log_id, or if they are real duplicates.

Finally, in case there's something subtle about your code that isn't working, here's the way I'd code that percentage calculation -

 source="/msg.stats/tcp/999" message_type="Log" 
| eventstats count as total 
| eventstats count as duplicates by log_id
| eval ImADuplicate = if(duplicates>1,1,0)
| stats count as EventCount, sum(ImADuplicate) as DupCount, first(total) as total 
| eval NonDupCount = EventCount - DupCount
| eval DupPercentage=DupCount/EventCount*100
| eval NonDupPercentage=NonDupCount/EventCount*100
| table total EventCount DupCount NonDupCount DupPercentage NonDupPercentage

rjthibod · ‎01-18-2017

How about this query? Anything where count is greater than 1 has duplicates.

source="/msg.stats/tcp/999" message_type="Log" log_id > 1 | stats count by log_id | sort -count

Is my search correct in finding the duplicate total count of messages based on a single field value?

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard

Are you a member of the Splunk Community?

Is my search correct in finding the duplicate total count of messages based on a single field value?

Splunk Observability for AI

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Splunk Observability as Code: From Zero to Dashboard