Our purpose is to get the most recent event with specific fields by "dedup" command in indexer cluster
We have read a similar case according to this link, but still confused about the usage of dedup.:
https://answers.splunk.com/answers/323510/how-to-keep-all-most-recent-events-for-a-specific.html
The following is our case
Event sample (index=myIndex)
conditions:
(1) 1 search-head + 2 indexer instances (we use index cluster)
(2) each event have one duplicated record (marked "duplicated event")
2019-12-04 12:00:00, machine=serverA, result=pass # duplicated event
2019-12-04 12:00:00, machine=serverA, result=pass
2019-12-04 12:00:00, machine=serverB, result=pass # duplicated event
2019-12-04 12:00:00, machine=serverB, result=pass
2019-12-03 12:00:00, machine=serverA, result=fail # duplicated event
2019-12-03 12:00:00, machine=serverA, result=fail
2019-12-03 12:00:00, machine=serverB, result=fail # duplicated event
2019-12-03 12:00:00, machine=serverB, result=fail
We want to get the most recent server's result per day, such as
Taget result
2019-12-04 12:00:00, machine=serverA, result=pass
2019-12-04 12:00:00, machine=serverB, result=pass
2019-12-03 12:00:00, machine=serverA, result=fail
2019-12-03 12:00:00, machine=serverB, result=fail
SPL query
index=myIndex
| dedup _time machine
Question:
Does "dedup" command "always" return the most recent events based on the specific fields crossing multiple indexers?
According to our case, If we apply the spl query based on our condition, can we always get the target result?
dedup
"removes the events that contain an identical combination of values for the fields that you specify", so as long as all of the logs are being pulled in to your searchhead from all of the indexers (which it looks like from your query results that they are), then yes, it will grab just one of them since you've specified those two fields. "Events returned by dedup are based on search order. For historical searches, the most recent events are searched first." So without a sort, it will just go in descending _time order, as that is the default for how Splunk reads in logs for historical (time based) searches. You can sort by _time or other fields as well with dedup
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/dedup#Optional_arguments
dedup _time machine sortby -_time
This doesn't make a ton of sense in this case because you're already specifying _time as a field to dedup on but a thought for the future. That being said, you can also leverage the stats
command, as this will give you more control over what exactly you want to be passed through, and with less fuzziness on what Splunk chose to dedup. Example:
stats values(result) by _time, machine
will return the unique set of results for each _time/machine pairing. I prefer this because it's very clear exactly what you're doing, and you can also more easily compare by switching values
to list
to see what is duplicated!
Hope this helps.
Thanks for your answer! it's really helpful.
We have considered using stats
before, but there were two reasons why we use dedup
rather than stats
performance aspect
if dedup
command only searchs the first matching event, does that mean the performance will be much better than stats
?
query complexity
if we have to deal with many fields, such as
2019-12-04 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-04 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-03 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-03 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
we can just use a very simple query to get the most recent result per day per machine
| dedup _time machine
2019-12-04 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
2019-12-03 12:00:00, machine=A, field1=x1 field2=x2 field2=x2 field3=x3......field100=x100
on the other hand, it will be more complected if we use stats
to deal with each field, such as
|stats latest(field1), latest(field2)... by _time machine
I'm not sure our consideration making sense or not, do you have any advice for this case?