Suppose I have log data like this:
2016-08-24 03:46:15 GMT vehicle_id="1075" vehicle_distance=145 stop_tag="5687" ... 2016-08-24 03:46:52 GMT vehicle_id="1075" vehicle_distance=19 stop_tag="5687" ... 2016-08-24 03:47:38 GMT vehicle_id="1075" vehicle_distance=47 stop_tag="5687"
For a given vehicle, it shows its distance to the closest stop over time as data is transmitted occasionally by the vehicle. I want to select only those log entries where, for a particular vehicle/stop pair, the distance is the smallest. For this sample data, I'd want only the middle entry because it has a distance of 19 that is the smallest among 19, 47, and 145.
Note that the "..." above means that there are other log entries intermixed having different vehicles and their closest stops.
I want this "smallest distance" log entry every time the vehicle passes by the stop. For example, if the vehicle passed by the stop again two hours later (because it's done a complete loop of its route), I still want the "smallest distance" from the first time it passed by and the "smallest distance" from the second time it passed by --- and so on for every time it passes by the stop. The multiple "clusters" of vehicle/stop pairs should be considered independently when finding each "cluster's" smallest distance.
BTW: a "cluster" of log entries for a given vehicle/stop pair could have any number of entries (but probably small, i.e., less than, say, 5). If there's only one entry, obviously that's the one with the smallest distance.
How can I do this?
Here's a rough crude approach:
... | streamstats current=f global=f window=2 first(vehicle_distance) as vehicle_distance_1 last(vehicle_distance) as vehicle_distance_2 by vehicle_id stop_tag | where vehicle_distance_2 <= vehicle_distance_1 AND vehicle_distance_2 <= vehicle_distance
That's assuming each vehicle-stop-instance triple has a distinct trough of distance. This will copy over two distances from two adjacent events for the vehicle-stop pair, and only keep those with that trough. The event you're actually looking for is the one next to this event, you may need to copy over additional values such as time if you need more than vehicle, stop, and shortest distance.
Getting the exact event is more work for Splunk because you'd have to do | streamstats ... | reverse | streamstats ...
to copy over one value from neighbouring events on both sides.
Index all the data, then run your custom python command over it at search time.
I'm sure there are plenty Splunky ways to do this natively, e.g. streamstats with some fancy resetting, but I don't understand enough of your requirements to continue.
For example, you mentioned a timeout kinda thing... yet, I don't see anything about that in your pseudocode.
Native doesn't necessarily mean better, do go along the custom command route if you can express your problem in python.
Again, I could; but, as I said, then I wouldn't have all the original data indexed just in case I ever need it for anything.
Also, it would be nice to learn more SPL to know how to do this in Splunk directly.
If the assumption doesn't hold then all things relying on the assumption won't work.
I'd question the data's usefulness then though. In my mind, the minimum amount of data to determine a closest distance would be:
The second would be the trough of distance, and the one you'd be looking for. If you don't even have this minimum amount of data you should probably post up an anonymized full sample set.
BTW: I could write an external program to filter the results. The algorithm in Python is:
log_d = { }
for line in args.in_file:
new_d = parse( line ) # parse into timestamp & key/value pairs
vid = new_d[ K_VID ]
if vid not in log_d:
log_d[ vid ] = new_d
else:
old_d = log_d[ vid ]
old_stop = old_d[ K_STAG ]
new_stop = new_d[ K_STAG ]
if new_stop == old_stop:
if new_d[ K_VDISTANCE ] < old_d[ K_VDISTANCE ]:
log_d[ K_VID ] = new_d
else:
log_vehicle_at_stop( old_d, args.out_file )
log_d[ vid ] = new_d
But I'd rather have all the data inside Splunk and use Splunk to filter it if possible.
Or I could take the Python program and make a custom streaming search command out of it. But doing it "native" would probably still be better.
The data for your #3 may not be there because the vehicles report their GPS location only occasionally. It could easily be the case that the next time the vehicle reports its location that it's actually closer to it's next stop.
It could also happen the other way around, i.e., the first time the vehicle reports its closest stop, it could be, say, 20 feet away; but the next time it reports its closest stop, it could be, say, 55 feet away.
If by "trough" you mean a distance pattern of I, J, K where J is less than either I or K, then no. There could be any (small) number of log entries for a vehicle/stop pair. I've updated my original question to reflect this. Will your solution still work?
The toughest problem is to know when to reset; that is to know when the car passes by a stop and then you need to start tracking of the next pass. In your example, how do you know that these three values are from the "same time it passed by" and not the first one being "the closest on the first time", the middle one being the "closest on the second time" and the third one to be "the closes on the third time"?
There are a couple of ways: (1) the log would contain the vehicle at other stops in the interim; (2) the timestamp of separate clusters would be far apart in time.