Selecting log entry having smallest field value

plucas_splunk · ‎08-23-2016

Suppose I have log data like this:

2016-08-24 03:46:15 GMT vehicle_id="1075" vehicle_distance=145 stop_tag="5687"
...
2016-08-24 03:46:52 GMT vehicle_id="1075" vehicle_distance=19 stop_tag="5687"
...
2016-08-24 03:47:38 GMT vehicle_id="1075" vehicle_distance=47 stop_tag="5687"

For a given vehicle, it shows its distance to the closest stop over time as data is transmitted occasionally by the vehicle. I want to select only those log entries where, for a particular vehicle/stop pair, the distance is the smallest. For this sample data, I'd want only the middle entry because it has a distance of 19 that is the smallest among 19, 47, and 145.

Note that the "..." above means that there are other log entries intermixed having different vehicles and their closest stops.

I want this "smallest distance" log entry every time the vehicle passes by the stop. For example, if the vehicle passed by the stop again two hours later (because it's done a complete loop of its route), I still want the "smallest distance" from the first time it passed by and the "smallest distance" from the second time it passed by --- and so on for every time it passes by the stop. The multiple "clusters" of vehicle/stop pairs should be considered independently when finding each "cluster's" smallest distance.

BTW: a "cluster" of log entries for a given vehicle/stop pair could have any number of entries (but probably small, i.e., less than, say, 5). If there's only one entry, obviously that's the one with the smallest distance.

How can I do this?

martin_mueller · ‎08-23-2016

Here's a rough crude approach:

... | streamstats current=f global=f window=2 first(vehicle_distance) as vehicle_distance_1 last(vehicle_distance) as vehicle_distance_2 by vehicle_id stop_tag | where vehicle_distance_2 <= vehicle_distance_1 AND vehicle_distance_2 <= vehicle_distance

That's assuming each vehicle-stop-instance triple has a distinct trough of distance. This will copy over two distances from two adjacent events for the vehicle-stop pair, and only keep those with that trough. The event you're actually looking for is the one next to this event, you may need to copy over additional values such as time if you need more than vehicle, stop, and shortest distance.
Getting the exact event is more work for Splunk because you'd have to do | streamstats ... | reverse | streamstats ... to copy over one value from neighbouring events on both sides.

martin_mueller · ‎08-25-2016

Index all the data, then run your custom python command over it at search time.

I'm sure there are plenty Splunky ways to do this natively, e.g. streamstats with some fancy resetting, but I don't understand enough of your requirements to continue.
For example, you mentioned a timeout kinda thing... yet, I don't see anything about that in your pseudocode.

martin_mueller · ‎08-25-2016

Native doesn't necessarily mean better, do go along the custom command route if you can express your problem in python.

plucas_splunk · ‎08-25-2016

Again, I could; but, as I said, then I wouldn't have all the original data indexed just in case I ever need it for anything.

Also, it would be nice to learn more SPL to know how to do this in Splunk directly.

martin_mueller · ‎08-24-2016

If the assumption doesn't hold then all things relying on the assumption won't work.

I'd question the data's usefulness then though. In my mind, the minimum amount of data to determine a closest distance would be:

vehicle is some distance away
vehicle is even closer
vehicle is further away again

The second would be the trough of distance, and the one you'd be looking for. If you don't even have this minimum amount of data you should probably post up an anonymized full sample set.

plucas_splunk · ‎08-25-2016

BTW: I could write an external program to filter the results. The algorithm in Python is:

log_d = { }
for line in args.in_file:
    new_d = parse( line ) # parse into timestamp & key/value pairs
    vid = new_d[ K_VID ]
    if vid not in log_d:
        log_d[ vid ] = new_d
    else:
        old_d = log_d[ vid ]
        old_stop = old_d[ K_STAG ]
        new_stop = new_d[ K_STAG ]
        if new_stop == old_stop:
            if new_d[ K_VDISTANCE ] < old_d[ K_VDISTANCE ]:
                log_d[ K_VID ] = new_d
        else:
            log_vehicle_at_stop( old_d, args.out_file )
            log_d[ vid ] = new_d

But I'd rather have all the data inside Splunk and use Splunk to filter it if possible.

Or I could take the Python program and make a custom streaming search command out of it. But doing it "native" would probably still be better.

plucas_splunk · ‎08-24-2016

The data for your #3 may not be there because the vehicles report their GPS location only occasionally. It could easily be the case that the next time the vehicle reports its location that it's actually closer to it's next stop.

It could also happen the other way around, i.e., the first time the vehicle reports its closest stop, it could be, say, 20 feet away; but the next time it reports its closest stop, it could be, say, 55 feet away.

plucas_splunk · ‎08-24-2016

If by "trough" you mean a distance pattern of I, J, K where J is less than either I or K, then no. There could be any (small) number of log entries for a vehicle/stop pair. I've updated my original question to reflect this. Will your solution still work?

mIliofotou_splu · ‎08-23-2016

The toughest problem is to know when to reset; that is to know when the car passes by a stop and then you need to start tracking of the next pass. In your example, how do you know that these three values are from the "same time it passed by" and not the first one being "the closest on the first time", the middle one being the "closest on the second time" and the third one to be "the closes on the third time"?

plucas_splunk · ‎08-24-2016

There are a couple of ways: (1) the log would contain the vehicle at other stops in the interim; (2) the timestamp of separate clusters would be far apart in time.

Selecting log entry having smallest field value

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Value Insights: Now Generally Available in the CMC

What’s New in Splunk AI: Volume 02

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!

Join the Conversation

Selecting log entry having smallest field value

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Value Insights: Now Generally Available in the CMC

What’s New in Splunk AI: Volume 02

Splunk App Dev Quarterly Roundup: AI, Agents, and Innovation!