If I have an eventtype with a duration=X seconds field, I can chart the concurrency with eventtype=ABC | concurrency duration | timechart duration. How do I find out the total duration of all the events during any one hour window of time? (E.g., if an event started half way through and exceeded the window, only counting the part that didn't exceed the window. Likewise for events that started beforehand.) The end goal will be to calculate the max(SumDurationInHour) per day, along with the perc99(SumDurationInHour) per day.
The way I used to calculate this (via another logging system) was an event generated once every 5 seconds that listed the number of concurrent events at that point in time. I then aggregated and manually found the busiest window. Note that ideally, this isn't going to be based on a date_hour, but any one hour window. (I've created a separate question for that)
Background: I'm trying to write a report that will help with sizing phone lines, based on the Erlang B equation. The equation asks you to find the number of phone-hours (aka Erlangs) used on the busiest hour of the busiest day. This means that if 5 people got on an hour long conference call at 3:35 PM, but no one else used the phone, the busiest hour would be from 3:35 to 4:35 and use 5 phone-hours. Likewise, if in the middle of that call, another conference call with 4 people started at 4:05 and went through 5:00, our busiest hour would still be from 3:35 to 4:35, but we would have used 7 phone-hours.
I see why you've been going down the concurrency road... In search of the accurate tracking with respect to hour boundaries.
1) Something like this is all I can think of but it doesnt do what you want either:
<your event search> | concurrency duration=duration | delta _time as timeDelta | eval blockOfTime=concurrency*timeDelta | stats sum(blockOfTime) by date_hour
2) The following search is more straightforward but it doesnt account for calls that began in one hour and ended in another hour either. (Instead calls will always be counted against the hour within which they ended.)
<your event search> | stats sum(duration) as totalDuration by date_month, date_mday, date_hour | sort - totalDuration + date_month + date_mday + date_hour | stats first(totalDuration) as highestHourTotalDuration first(date_hour) as highestHour by date_month date_mday
In english, the first stats clause gives you the total duration for every hour of every day. Then the sort clause sorts within each day such that the hours are actually ordered by duration. Then the second stats clause is taking the first totalDuration value and the first date_hour value for each day. Because of the sort those will be the highest hours.
Maybe someone can see the missing ingredient to add to the concurrency + delta + stats solution above...
I’d considered going that route, and may have to, but I’m concerned about the inaccuracy of having calls accounted for in the incorrect hour, and hours starting on a fixed schedule, as you mentioned. The reason is that the Erlang B equation utilizes this value and the organizational tolerance for risk of blocking to determine line count. If an organization tends to have an after-lunch bump of long calls that splits between two hours, they would be exposed to unexpected risk. It might not be possible, but I’d prefer to get 100% accurate details within Splunk. Do you know if that’s possible?
I understand it better now. I updated my answer but I still dont have something that gives you the full accuracy. Maybe a bit more percolation and it'll become clearer again. 😃
Delta... very interesting. One other route that I'd considered going down was to have an eval that setup an array for year+day+hour, and then splitting phone calls into the hours they covered, adding them to the array, and then doing statistics off that. But after googling, I'm left with the very strong suspicion that this isn't possible, and violates the spirit of Splunk. I'll play around with what you added tomorrow. Thanks!
Hmm. I ran the first query you provided, and it overestimated by nearly 100%. Upon reflection, I'm thinking that's because an event is only issued once per call, and so concurrency only exists once per call. So the delta, in this case, is the difference between when one phone call starts and the next one starts, even if the first call ends 10 minutes earlier. (Event timestamp = start of the call.) If that's right, I could inject "call ended" events, but that's exceedingly ugly, and we're still left with the "on the hour" versus "for any hour" issue. Does that all make sense?
So with the help of Nick, the first part of the question is now working, although the only way I could see to implement it was somewhat messy. Instead of looking at the problem from the perspective of deriving the timing from the concurrency, I'm adding the concurrency to the timing. Specifically, joining the call logs to an eventset that has a single event every second. Given that, we can know exactly how many call-seconds are used every actual second, and then summarize up. (Nick's insight)
The messy part (my contribution) is that the only way I could see how to get a constant stream of events for each second was to fabricate a new log file, with a fillerevent for each second (aka, a file with 86400 events/day). It would be better if there were a way of doing this without this ugliness, but it does get the job done.
Here is the search I'm using:
index=GenericEvents sourcetype=GenericEvents earliest=10/25/2010:05:0:0 latest=10/25/2010:23:0:0
| eval SecondOfDay=date_hour*60*60+date_minute*60+date_second
| bucket _time span=1s
| stats count as GenericEvents by SecondOfDay
| eval duration=1
[search index=cisco_cdr duration>0 earliest=10/25/2010:05:0:0 latest=10/25/2010:23:0:0
| eval SecondOfDay=date_hour*60*60+date_minute*60+date_second
| fields duration pkid SecondOfDay
| concurrency duration=duration start=SecondOfDay
| delta SecondOfDay as deltaTime
| stats max(concurrency) as MaxConcurrency max(deltaTime) as MaxDelta by SecondOfDay
| eval TimeInUse=MaxDelta*(MaxConcurrency-1)
| stats sum(TimeInUse)
Stepping through that, we pull out the SecondOfDay (e.g., Second #82487), set a duration for the GenericEvents ID, then add in the actual data (subsearch -- again using SecondOfDay). Then I run concurrency, summarize to one event per second (fields: max concurrency), find the delta between events (should always be 1, but this way I can change the frequency at will, and it won't skew the data), multiple the delta by the concurrency to find out how many seconds of use there are for each event, and then summarize from there.
This definitely feels silly complex, but with my relatively small data set, it does produce accurate results. Note that this doesn't solve the second part of my question (rolling averages) -- I'll open a separate Splunk Answers for that, so that this particular question doesn't get too lengthy.