Hello!
Events I'm analyzing represent sites in down state. Each event has SITENAME, EVENTTIME and CLRTIME, obviously representing site down time and ceasing time.
I need to create the daily availability report.
I'm using a BIN command to have a table per day. The problem appears when an event is longer then 24 hours. In such situation a day "in the middle" of the event has no site down event for the particular site. However this day availability for this particular site is 0.
03/12------04/12------05/12
Event---------------------Ceasing
Here the event appears on 03/12 and lasts till 05/12. How can I extend (or split, or create "dummy" events) on 04/12 and 05/12?
Thanks!
NJ
Based on your comments, its sufficient if the report covers downtime events only after they are cleared, meaning this won't show you downtimes that have begun in the past and are not cleared yet. Since you're using fields in the search provided above that I don't know about and I don't exactly understand what you're doing there, I'm just going to calculate things from what I know and try to come up with what I think you mean. I've used this to mock some data:
| makeresults | eval clrtime = 1554768400, eventtime = 1554761200, sitename="foo"
| append [| makeresults | eval clrtime = 1554867600, eventtime = 1554761100, sitename="bar"]
| append [| makeresults | eval clrtime = 1554768400, eventtime = 1554636473, sitename="baz"]
From here, I've used join
with the max=0
setting to expand each event into multiple (based on the count
setting of makeresults
: change this to the number of days you want to show):
| join max=0 [| makeresults count=7 | streamstats count | eval time = relative_time(_time - (count - 1) * 68400, "@d") | fields time]
The subsearch on its own just creates a number of rows with days as epoch values in the field time
. Now all we need to do is calculate the duration per day that a site is down:
| eval outage_secs_from = if(eventtime <= (time + 86400), min((time + 86400) - eventtime, 86400), 0)
| eval outage_secs_to = if(clrtime > time, min(clrtime - time, 86400), 0)
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| fieldformat outage_total_duration = tostring(outage_secs_total, "duration")
| eval date = strftime(time, "%F")
This is the data you can now work with, here you'll see what will become your chart in a minute (hence the fieldformat outage_total_duration
for readability). The next step transforms your data (similar to the xyseries
in your example):
| chart sum(outage_secs_total) by sitename date
This would give you a sum of the outage seconds per sitename and day (I've used strftime
with %F
here, but you can also use any other string). I see you used some sort of count and mean by date and sitename in your exaple, right before the chart would be the place to do all kinds of calculations. If you wanted to have site availability as a percentage of seconds in a day, you would change those last lines to
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| eval availability = round(100 * (86400 - outage_secs_total) / 86400, 2)
| eval date = strftime(time, "%F")
| chart avg(availability) by sitename date
I hope this is going in the right direction. Feel free to come back with any follow up questions.
Based on your comments, its sufficient if the report covers downtime events only after they are cleared, meaning this won't show you downtimes that have begun in the past and are not cleared yet. Since you're using fields in the search provided above that I don't know about and I don't exactly understand what you're doing there, I'm just going to calculate things from what I know and try to come up with what I think you mean. I've used this to mock some data:
| makeresults | eval clrtime = 1554768400, eventtime = 1554761200, sitename="foo"
| append [| makeresults | eval clrtime = 1554867600, eventtime = 1554761100, sitename="bar"]
| append [| makeresults | eval clrtime = 1554768400, eventtime = 1554636473, sitename="baz"]
From here, I've used join
with the max=0
setting to expand each event into multiple (based on the count
setting of makeresults
: change this to the number of days you want to show):
| join max=0 [| makeresults count=7 | streamstats count | eval time = relative_time(_time - (count - 1) * 68400, "@d") | fields time]
The subsearch on its own just creates a number of rows with days as epoch values in the field time
. Now all we need to do is calculate the duration per day that a site is down:
| eval outage_secs_from = if(eventtime <= (time + 86400), min((time + 86400) - eventtime, 86400), 0)
| eval outage_secs_to = if(clrtime > time, min(clrtime - time, 86400), 0)
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| fieldformat outage_total_duration = tostring(outage_secs_total, "duration")
| eval date = strftime(time, "%F")
This is the data you can now work with, here you'll see what will become your chart in a minute (hence the fieldformat outage_total_duration
for readability). The next step transforms your data (similar to the xyseries
in your example):
| chart sum(outage_secs_total) by sitename date
This would give you a sum of the outage seconds per sitename and day (I've used strftime
with %F
here, but you can also use any other string). I see you used some sort of count and mean by date and sitename in your exaple, right before the chart would be the place to do all kinds of calculations. If you wanted to have site availability as a percentage of seconds in a day, you would change those last lines to
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| eval availability = round(100 * (86400 - outage_secs_total) / 86400, 2)
| eval date = strftime(time, "%F")
| chart avg(availability) by sitename date
I hope this is going in the right direction. Feel free to come back with any follow up questions.
With this approach it's possible to expand the event only by the fixed number of new events, 7 in your case, defined by the count.
In my case I never know how long is the event, count must be variable calculated per each event. Can this be fixed?
Sure that's possible, we're just going to have to change how we create our events per day. Replace the join
command with these three lines:
| eventstats max(clrtime) as max_clrtime min(eventtime) as min_eventtime
| eval time = mvrange(relative_time(min_eventtime, "@d"), relative_time(max_clrtime, "@d+1d"), "1d")
| mvexpand time
This creates a multivalue field with mvrange, specifically its ability to create date ranges. It excludes the end value, which is why we're adding an additional day to that parameter. The dates for this mvrange are your earliest and latest eventtime
and clrtime
values.
Be aware that this dynamic solution will give you columns for as many days as there are in your data, and that chart
will create an OTHER
column when it gets more than 10 values. Use chart limit=20
or some other value to increase this number.
Thanks,
but the above does not create new events, if I'm not missing something.
It does, mvexpand
creates new events from multivalue fields. Maybe I was unclear in what to replace:
| makeresults | eval clrtime = 1554768400, eventtime = 1554761200, sitename="foo"
| append [| makeresults | eval clrtime = 1554867600, eventtime = 1554761100, sitename="bar"]
| append [| makeresults | eval clrtime = 1554768400, eventtime = 1554636473, sitename="baz"]
| eventstats max(clrtime) as max_clrtime min(eventtime) as min_eventtime
| eval time = mvrange(relative_time(min_eventtime, "@d"), relative_time(max_clrtime, "@d+1d"), "1d")
| mvexpand time
| eval outage_secs_from = if(eventtime <= (time + 86400), min((time + 86400) - eventtime, 86400), 0)
| eval outage_secs_to = if(clrtime > time, min(clrtime - time, 86400), 0)
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| eval date = strftime(time, "%F")
| chart sum(outage_secs_total) by sitename date
jeffland,
I appreciate your help very much.
I had to add a single correction to your code, the events that are started and cleared on the same day were not calculated right. The following has fixed this problem.
| eval outage_secs_between = clrtime-eventtime
| eval outage_secs_total = min(outage_secs_from, outage_secs_to, outage_secs_between)
Regards!
Oh, yes you're right - I forgot that case. Happy to help!
When are your events created, at CLRTIME
? I assume EVENTTIME
and CLRTIME
are timestamps?
eventtime - event creation time, clrtime - event clearance time. For the particular site down event, I actually have two messages. First, when the site goes down, there's an event with a specific sitename and eventtime. clrtime is empty at this moment. When the site comes back online, there's another message, with the sitename and eventtime same as in the previous message and the crltime now is not empty and is he time site came back online.
And is there a common id in the two events? Or can we otherwise assume that if a combination of site name and eventtime is found only twice, once with an empty clrtime field and once with that field filled, that the two events refer to the same site down event?
It would be easier to use a combination of the sitename and eventtime. There is some ID but it's not plain and needs a lot of regexing. If you think there's no other way I'll work on it. Actually we can use only the event with both eventtime and clrtime present and ignore the event with the empty clrtime field. I do not need all that running realtime, it's ok to have report when the event with the clrtime arrives.
So when it arrives, we'll have eventtime (start) and clrtime (event end).
Thanks,
NJ
Evaluate how you are calculating eventtime_u. You may need to adjust for that to have a globally unique event id that occurs at both the start and the stop of the event. Perhaps the transaction command (https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Transaction).
eventime_u=clrtime-eventtime; it's set through the "calculated fields" from the Web GUI.
I may be able to extract the unique ID it's there in the raw data. It could give me a possibility to fix on the ceasing day. But it's still unclear what could be done on the day in the middle of the event.
I was thinking of makeresults command, but not sure.
Give us your existing search. How are we supposed to help with only a description?
I was looking for a hint or an idea. Here are all the details, anyways.
| search (SPECPB="*LOSS-OF-ALL-CHAN*")
| dedup EVTTIME SPECPB
| eval bintime_u=eventtime_u
| bin bintime_u span=1d
| stats list(sectorname) as sectorname list(eventtime) as eventtime list(clrtime) as clrtime list(eventduration_u) as eventduration_u sum(eventduration_u) as cumulative_sector_down count(eventduration_u) as cumulative_event_count by bintime_u sitename
| eval bintime=strftime(bintime_u,"%m-%d")
| eval mean_site_down=cumulative_sector_down/cumulative_event_count
| eval siteavailability_d=100-round(100*mean_site_down/86400,2)
| xyseries sitename,bintime,siteavailability_d
My events:
_time sitename sectorname eventtime clrtime SPECPB eventduration_u
So I'm getting correct results when the event lasts within a day (i.e. it starts after 00:00 and ends before 24:00 of the same day). If the event lasts longer then one day, starting on the day before and ending on the next day, the current day has obviously no event. I need to spread such events trough the following bins somehow.
Regards,
NJ