I have multiple log sources that are appended on a daily basis. All rows in one refresh have same epoch time. I would like to select all values from each log source based on last epoch time.
I get the last collection epoch by following stats
sourcetype=my_sourcetype | fields epochtime, source | stats max(epochtime) as lasttime by source
This gives me a list of epoch value with each source.
Next, I am trying to filter based on above as follows
sourcetype=my_sourcetype |rename epochtime as lasttime | join type=inner lasttime, source[ search sourcetype=my_sourcetype | fields epochtime, source | stats max(epochtime) as lasttime by source ]
Expecting that it would filter like SQL filter where a.lasttime=b.lasttime and a.source=b.source but it seems like I get for each epoch, multiple events from the main query.
How do I change this query to filter based on the results in stats?
Okay, so first, go read this, so you understand how to think about Splunk queries. Thinking in terms of SQL will result in extremely inefficient code in Splunk, every time.
Second, if you just want the last event of each source, then use this and you are done...
sourcetype=my_sourcetype | dedup source
dedup picks the first record of each combination of keys. In this case, the only key given is
source, so it picks the first record of each source.
By default, Splunk returns the most recent records first. Thus, the most recent of each will be chosen.
Third, if there is a chance that some source has stopped being sent, and the most recent date will not have any of that source, and you want to get rid of that source because the date doesn't match, then do this...
sourcetype=my_sourcetype | dedup source | eventstats max(epochtime) as lasttime | where epochtime = lasttime
eventstats is like
stats, except it does not delete the underlying events, it just copies the results onto every event that matches the keys. In this case, we don't need any key, because all the events from the most recent load have the same
epochtime value. If any remaining event does NOT have the same value, then it means it is a
source that is left over from a prior load.
Fourth, here's a free bonus method that will sometimes (or even often) be more efficient than
sourcetype=my_sourcetype | stats latest(epochtime) as epochtime latest(someotherfield) as someotherfield by source | eventstats max(epochtime) as lasttime | where epochtime = lasttime
Explanation: within a
latest(x) will pick the value of x from the record with the highest value for
Stats can be more efficient in some cases because it can be partially calculated at the indexes, limiting the amount of data that gets transferred to the search head.
Test this version against the third version and inspect the results to see which one is more efficient in your particular situation.
Thank you for the answer with an alternate solution.
However, I was looking for help on Join and why does the join stated in question does not work. Is the join with more than one fields not supported? Or is it something else incorrect in the way I am joining the results.
@saboobaker - I gave you all those because this structure is extremely inefficient. I don't see any issues with the theory, though.
First, put white space between the
source[ just in case the subsearch does not return a starting space. If that works, you are done.
Second, check to see whether you are running out of time or space. That should not be the case, since the subsearch is only returning two fields. Nonetheless, it is a theoretical possibility. Look at the job inspector and look for language about the job being complete or incomplete, and how many events were scanned and how many returned. If that looks normal, then check the search log to see what happened there.