How to get the most recent event with specific fie...

davidgogogo · ‎12-03-2019

Our purpose is to get the most recent event with specific fields by "dedup" command in indexer cluster

We have read a similar case according to this link, but still confused about the usage of dedup.:
https://answers.splunk.com/answers/323510/how-to-keep-all-most-recent-events-for-a-specific.html
The following is our case

Event sample (index=myIndex)
conditions:
(1) 1 search-head + 2 indexer instances (we use index cluster)
(2) each event have one duplicated record (marked "duplicated event")

2019-12-04 12:00:00, machine=serverA, result=pass # duplicated event
2019-12-04 12:00:00, machine=serverA, result=pass 
2019-12-04 12:00:00, machine=serverB, result=pass # duplicated event
2019-12-04 12:00:00, machine=serverB, result=pass
2019-12-03 12:00:00, machine=serverA, result=fail # duplicated event
2019-12-03 12:00:00, machine=serverA, result=fail   
2019-12-03 12:00:00, machine=serverB, result=fail # duplicated event
2019-12-03 12:00:00, machine=serverB, result=fail

We want to get the most recent server's result per day, such as

Taget result

2019-12-04 12:00:00, machine=serverA, result=pass 
2019-12-04 12:00:00, machine=serverB, result=pass
2019-12-03 12:00:00, machine=serverA, result=fail   
2019-12-03 12:00:00, machine=serverB, result=fail

SPL query

index=myIndex
| dedup  _time machine

Question:
Does "dedup" command "always" return the most recent events based on the specific fields crossing multiple indexers?

According to our case, If we apply the spl query based on our condition, can we always get the target result?

aberkow · ‎12-04-2019

dedup "removes the events that contain an identical combination of values for the fields that you specify", so as long as all of the logs are being pulled in to your searchhead from all of the indexers (which it looks like from your query results that they are), then yes, it will grab just one of them since you've specified those two fields. "Events returned by dedup are based on search order. For historical searches, the most recent events are searched first." So without a sort, it will just go in descending _time order, as that is the default for how Splunk reads in logs for historical (time based) searches. You can sort by _time or other fields as well with dedup https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/dedup#Optional_arguments

dedup _time machine sortby -_time

This doesn't make a ton of sense in this case because you're already specifying _time as a field to dedup on but a thought for the future. That being said, you can also leverage the stats command, as this will give you more control over what exactly you want to be passed through, and with less fuzziness on what Splunk chose to dedup. Example:

stats values(result) by _time, machine

will return the unique set of results for each _time/machine pairing. I prefer this because it's very clear exactly what you're doing, and you can also more easily compare by switching values to list to see what is duplicated!

Hope this helps.

davidgogogo · ‎12-05-2019

Thanks for your answer! it's really helpful.

We have considered using stats before, but there were two reasons why we use dedup rather than stats

performance aspect
if dedup command only searchs the first matching event, does that mean the performance will be much better than stats?

query complexity

if we have to deal with many fields, such as

2019-12-04 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-04 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-03 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-03 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100

we can just use a very simple query to get the most recent result per day per machine
| dedup _time machine

2019-12-04 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-03 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100

on the other hand, it will be more complected if we use stats to deal with each field, such as
|stats latest(field1), latest(field2)... by _time machine

I'm not sure our consideration making sense or not, do you have any advice for this case?

How to get the most recent event with specific fields by "dedup" command in indexer cluster condition?

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?