I know that the "dedup" command returns the most recent values in time. However, I'm currently in a situation where I want to use dedup to only keep the oldest events from my data (example below). I found the following thread which is identical to my question, but the proposed solution (sorting on +_time) does not seem to work for me.
What I specifically have are a bunch of client requests to a web server. Each event has an associated
req_time and a
session_id; many transactions can share the same
session_id. What I want to do is call
'...|dedup session_id' and have only the OLDEST transaction from each individual
session_id be returned, rather than the NEWEST.
Any suggestions on how to accomplish this?
I think you will find the sortby parameter to do this for you.
YourSearch | dedup session_id sortby +_time
Check out the docs for more ways you can tweak dedup:
Thanks for the reply, David.
I mentioned that I tried this solution in my earlier question. For some reason, it did not work yesterday and only the oldest events were removed. However, it is working this morning to my pleasant surprise.
Any idea as to why that happened?
EDIT: Answered my own question, but I'm still mystified by it. The query which successfully returned the oldest events included some concurrency information that I had been playing around with.
... | eval timeout=1599 | ... | concurrency duration=timeout | dedup session_id
The above works. I have no idea why.
I just tried that, and can definitely confirm what you found. If you toss a concurrency before the dedup, it does return the same results as if you had done a sortby +time. You should be able to override this by doing a sortby -time, but that search failed for me ("job ... is a zombie and is no longer with us"). This appears to be a bug, where concurrency is doing some sort of work on _time, and breaking dedup.
Fortunately, if you need to grab the newest events after running a concurrency (or either way want to wrest control of your search's fate out from the hands of concurrency), you can work around this by creating another time field. I was able to do:
MySearch | eval MyTime = _time | concurrency duration=duration output=concurrentevents | dedup MyField sortby -MyTime
Without the same issue. Likewise, +MyTime works.
Does that get you where you need to be?
I am in kind of same situation , I need to retrieve results for latest time instead of old events.
I performed search as -
index=x | eval sorttime=strptime('time',"%m/%d/%Y %H:%M:%S%p")| sort -sorttime |dedup hostname compName +time keepempty=true | xyseries hostname compName status
This should retrieve latest week / time results instead it's showing old week data