My question is similar to this: https://answers.splunk.com/answers/35759/keping-only-most-recent-events-for-a-fixed-field.html
Basically, I have scan data that looks something like this:
scanIDa,machine1,fail1 scanIDa,machine1,fail2 scanIDb,machine1,fail1 scanIDb,machine1,fail2 scanIDb,machine1,fail3 scanIDc,machine2,fail1 scanIDc,machine2,fail2 scanIDc,machine2,fail3 scanIDd,machine2,fail1 scanIDd,machine2,fail3 scanIDe,machine3,fail1 scanIDf,machine3,fail1 scanIDf,machine3,fail2
I want to keep all the data for only the most recent scan on each machine. So the end result of my search should be something like this:
scanIDa,machine1,fail1 scanIDa,machine1,fail2 scanIDc,machine2,fail1 scanIDc,machine2,fail2 scanIDc,machine2,fail3 scanIDe,machine3,fail1
I don't want to know about fail3 on machine1 anymore because it was fixed in a more recent scan.
scanID is a random value. Looks like an md5 hash or something. Whatever it is, it's not usable as a sort field.
Is this possible? Am I dreaming?
dedup command should do what you want.
yoursearchhere | dedup scanID machineID
dedup preserves the first event it sees for each unique combination of scanID and machineID fields. Since Splunk returns events in reverse time order (newest first), the search results will contain only the most recent event.
There are other ways to approach this as well. The following search may give you more ideas...
yoursearchhere | stats latest(status) as status list(scanID) as scanIDs dc(scanID) as NumberofScans by machineID
I don't think that dedup command will work... It'll only keep the first event for every unique combination of scanID and machineID. I want to keep ALL events for the most recent scanID on every machine. In my example result set, wouldn't that throw these out?
I'll look into the stats line though, thanks!
@lguinn's answer helped me out a little, but it still didn't get me exactly what I wanted.
What I ended up doing was creating a lookup that runs once an hour. It does
searchhere | stats latest(scanID) as scanID by machine | eval mostRecent='yes'
Then I can just search for mostRecent='yes' and get the results I want.
I had the same issue with my Nessus scan data. I solved it using the streamstats command. The following search works for your example:
yoursearch | streamstats first(scanID) as scanID_first by machine | eval recent=if(scanID=scanID_first,"yes","no")
This will make your scan data look as follows:
scanID machine fail scanID_first recent scanIDa machine1 fail1 scanIDa yes scanIDa machine1 fail2 scanIDa yes scanIDb machine1 fail1 scanIDa no scanIDb machine1 fail2 scanIDa no scanIDb machine1 fail3 scanIDa no scanIDc machine2 fail1 scanIDc yes scanIDc machine2 fail2 scanIDc yes scanIDc machine2 fail3 scanIDc yes scanIDd machine2 fail1 scanIDc no scanIDd machine2 fail3 scanIDc no scanIDe machine3 fail1 scanIDe yes scanIDf machine3 fail1 scanIDe no scanIDf machine3 fail2 scanIDe no
Now you can search for recent="yes".
My case was a little different. The name of the scan (the "name" field) did not change. However, the "scanstart" field (when the scan was started) was different for each scan run. I wanted to keep the scans with the latest value of scanstart. So I used this search:
mysearch | streamstats first(scan_start) as scan_start_first by name | eval recent=if(scan_start=scan_start_first,"yes","no")