topic Re: How do I deduplicate events with such conditions? in Splunk Dev

How do I deduplicate events with such conditions?

szabados — Wed, 23 Aug 2017 20:12:07 GMT

So I got multiple custom datasources, scripts mainly, which are sending events to Splunk on some schedule/recurrence.
I can distinguish every execution of these sources by either a timestamp, or a custom ID, which gets incremented with every execution which is captured in every event. The events always have a proper host field, which also contributes to the "unique key" of an event with unique ID mentioned beforehand. The hosts are attributed with custom fields, this is the third part of something which could be used as uniqe key. These are always present in the events as long as they apply to a given host, and are no longer present when they don't apply to a host.

An example what I mean (every line is a separate event):

hostID=host1, attributeID=attribute1, customid=customid1
hostID=host1, attributeID=attribute2, customid=customid1
hostID=host2, attributeID=attribute1, customid=customid2
hostID=host1, attributeID=attribute1, customid=customid2

(Because of the _time field, these would appear in Splunk in reverse order obviously)

I want to deduplicate such events to always have the data only from the really last execution of a script. Like, from the above example, I want to have only

host2, attribute1, customid2
host1, attribute1, customid2

If I were to use

| dedup hostID, attributeID, customid

It would yield me
- host1, attribute2, customid1
- host2, attribute1, customid2
- host1, attribute1, customid2

The solution my team came up is using

<base search> | eventstats max(customid) as max_customid by hostID | search customid=max_customid

This pretty much does the thing, but I feel this is really not efficient - what would be the right approach do to this?

===EDIT

One given host has multiple events (with multiple attributes) from the same execution of the script.
A more detailed example, let's say I got these events:

hostID=host1, attributeID=attribute1, customid=customid1
hostID=host1, attributeID=attribute2, customid=customid1
hostID=host2, attributeID=attribute1, customid=customid2
hostID=host1, attributeID=attribute1, customid=customid2
hostID=host1, attributeID=attribute3, customid=customid2
hostID=host1, attributeID=attribute4, customid=customid2
hostID=host2, attributeID=attribute3, customid=customid2

I want to keep the below events:

hostID=host2, attributeID=attribute1, customid=customid2
hostID=host1, attributeID=attribute1, customid=customid2
hostID=host1, attributeID=attribute3, customid=customid2
hostID=host1, attributeID=attribute4, customid=customid2
hostID=host2, attributeID=attribute3, customid=customid2

This is the reason I can't use stats first()

Re: How do I deduplicate events with such conditions?

s2_splunk — Wed, 23 Aug 2017 20:49:16 GMT

Have you tried the "first" function with the stats command: <base search> | eval myKey=attributeID.customID | stats first(myKey) by hostID

Re: How do I deduplicate events with such conditions?

somesoni2 — Wed, 23 Aug 2017 20:53:16 GMT

How about you just do dedup on host??

Re: How do I deduplicate events with such conditions?

szabados — Thu, 24 Aug 2017 07:29:40 GMT

Unfortunately not what I need, please see me update on the original post above.

Re: How do I deduplicate events with such conditions?

s2_splunk — Thu, 24 Aug 2017 08:30:04 GMT

<base search> | eval myKey=hostID.attributeID.customID | dedup myKey

Should do what you want. Dedup keeps the youngest event that matches the combined key.

Re: How do I deduplicate events with such conditions?

woodcock — Fri, 25 Aug 2017 15:21:29 GMT

Let's baseline. These stats pairs are similar: first/last, earliest/latest, min/max. The last pair I think are obvious but the first pair are not the same as the second pair, which is what may people assume at first. If your events have not been resorted, they should (and this is a big "should" because sometimes Splunk fails to do this and doesn't always generate a warning) come back to you sorted in "newest to latest" order with newest on top. In such a case, first does the same thing as latest. Let that sink in: first DOES NOT do the same thing as earliest; it does the OPPOSITE. That is because what first actually does is walk backwards through your events from the top (which by default should be the "latest" event) and grab the "first" one that it sees.

OK, so for your case, simply sort your events the way that you desire (you can have multiple layers of sort by using more than 1 field argument) and then use first or dedup.

Pro tip: be sure that you use sort 0, not just sort.