Solved: Distinct count at any given time in a day

plucas_splunk · ‎08-30-2016

A particular public transit line is served by, say, N vehicles concurrently at any given time in the range [0,M] where M is the maximum number that ever serve the line concurrently. From the time the line starts, say 6am, vehicles come onto the line one by one until they reach M. Towards the end, say 9pm, vehicles start to leave the line one by one until N again reaches 0. Graphically, it could look like:

AAAAAAAAAAAAAAAAA
   BBBBBBBBBBBBB
        CCCCCCCCCCCCCCCC

Time progresses from left-to-right. Vehicle "A" is the first to go into service and, in the time before "B" does, N is 1. Once "B" goes into service, N increases to 2; and to 3 for "C". At some later time, "B" is the first to leave service, so N decreases to 2; and so on until N once again reaches 0.

The actual log data looks like:

2016-08-29 17:39:59 GMT vehicle_id="1007" speed=12 distance=192 stop="5240"
2016-08-29 17:40:26 GMT vehicle_id="1008" speed=16 distance=246 stop="4504"
2016-08-29 17:40:50 GMT vehicle_id="1015" speed=0 distance=125 stop="35184"
2016-08-29 17:41:06 GMT vehicle_id="1009" speed=19 distance=280 stop="4530"
2016-08-29 17:41:48 GMT vehicle_id="1008" speed=12 distance=66 stop="4534"
2016-08-29 17:41:51 GMT vehicle_id="1006" speed=14 distance=210 stop="5240"
2016-08-29 17:42:11 GMT vehicle_id="1009" speed=15 distance=272 stop="5174"
2016-08-29 17:42:09 GMT vehicle_id="1006" speed=0 distance=97 stop="5240"
2016-08-29 17:42:32 GMT vehicle_id="1009" speed=4 distance=105 stop="5174"
2016-08-29 17:42:21 GMT vehicle_id="1015" speed=0 distance=105 stop="35184"

Note that it's possible for a vehicle to be temporarily taken out of service mid-day. For example, if "B" were taken out of service temporarily, then it would look like:

BBBBB       BBBB

And N would decrease by 1 during the gap. In order for a gap to be considered a gap, there would be a gap in time when there are no log entries from "B". How long the gap has to be in order to be considered a gap should ideally be tunable, e.g., 30 minutes.

I want to plot N (the number of vehicles concurrently on the line) over time from the start of service to the end per day. How can I do that?

sundareshr · ‎08-30-2016

What does the value for stop represent? Would it be accurate to assume duration=distance/speed? Regardless, here's a thought...

base search | eval duration=distance/speed | eval starttime=_time | eval endtime=starttime+duration | eval per_hr=mvrange(starttime, endtime, "1h") | mvexpand per_hr | eval per_hr=strftime(per_hr, "%H") | stats dc(vehicle_id) as vehicles by per_hr

*UPDATED*

... | timechart span=G dc(vehicle_id) as vehicles

View solution in original post

tin_fish · ‎08-31-2016

Not writing out queries at this stage, but I think the only way you could really get 'N' is to apply some knowledge to the query in relation to the stop value. By this I mean, unless it is reflected in some way in the data (not described) you would have to know when a vehicle should reasonably expect to have reported at it's subsequent stop. So the most recent timestamp is valid unless that timestamp exceeds timestamp + maximum journey time to stop+1. If it exceeds the value it must be assumed it has been removed from service.

sundareshr · ‎08-30-2016

What does the value for stop represent? Would it be accurate to assume duration=distance/speed? Regardless, here's a thought...

base search | eval duration=distance/speed | eval starttime=_time | eval endtime=starttime+duration | eval per_hr=mvrange(starttime, endtime, "1h") | mvexpand per_hr | eval per_hr=strftime(per_hr, "%H") | stats dc(vehicle_id) as vehicles by per_hr

*UPDATED*

... | timechart span=G dc(vehicle_id) as vehicles

plucas_splunk · ‎08-30-2016

What does the value for stop represent?

stop is an arbitrary identifier for a physical bus stop.

Would it be accurate to assume duration=distance/speed?

No. distance is how many feet the vehicle is from its physically closest stop (that may either be the stop that it just past by or the stop it's approaching, whichever is closer); speed is the vehicle's speed at the time the GPS transponder on board the vehicle transmitted its data to be logged. Hence duration as you've written it is incorrect.

I included distance, speed, and stop for completeness, but I can't see any way they could possibly be part of a correct answer.

I would think the only things that matter are the timestamp and the vehicle_id. For example, if there is a log entry at T1 for vehicle "A" and also again at T2 for "A" and T2 – T1 <= G (where G is a tunable "gap" threshold), then we will assume that vehicle "A" was in service for the entire time between T1 and T2.

sundareshr · ‎08-30-2016

So if you set the span for timechart to be >=G, a dc would give you # vehicles on the road as (may need to filter our speed=0). In other words, why not

speed>0 | timechart span=G dc(vehicle_id) as vehicles

plucas_splunk · ‎08-30-2016

Vehicles should not be filtered based on their speed. If a vehicle is idling at either a bus stop (allowing passengers to board or alight) or at a red light, its speed will be 0, but it is still "in service" serving the line.

That aside, your timechart yields results that look like I would expect them to look. Thanks!

sundareshr · ‎08-30-2016

Awesome! Remember to accept the answer to close out this question

plucas_splunk · ‎08-30-2016

If there'a any way for you to convert your timechart comment into the actual answer (rather than a comment on the previous answer), that would benefit future readers.

sundareshr · ‎08-31-2016

I've updated the original answer

Distinct count at any given time in a day

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

Distinct count at any given time in a day

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits