Deployment Architecture

Event longer then bin - Availability report

Explorer

Hello!
Events I'm analyzing represent sites in down state. Each event has SITENAME, EVENTTIME and CLRTIME, obviously representing site down time and ceasing time.

I need to create the daily availability report.

I'm using a BIN command to have a table per day. The problem appears when an event is longer then 24 hours. In such situation a day "in the middle" of the event has no site down event for the particular site. However this day availability for this particular site is 0.

03/12------04/12------05/12
Event---------------------Ceasing

Here the event appears on 03/12 and lasts till 05/12. How can I extend (or split, or create "dummy" events) on 04/12 and 05/12?
Thanks!
NJ

0 Karma
1 Solution

Champion

Based on your comments, its sufficient if the report covers downtime events only after they are cleared, meaning this won't show you downtimes that have begun in the past and are not cleared yet. Since you're using fields in the search provided above that I don't know about and I don't exactly understand what you're doing there, I'm just going to calculate things from what I know and try to come up with what I think you mean. I've used this to mock some data:

| makeresults | eval clrtime = 1554768400, eventtime = 1554761200, sitename="foo"
| append [| makeresults | eval clrtime = 1554867600, eventtime = 1554761100, sitename="bar"]
| append [| makeresults | eval clrtime = 1554768400, eventtime = 1554636473, sitename="baz"]

From here, I've used join with the max=0 setting to expand each event into multiple (based on the count setting of makeresults: change this to the number of days you want to show):

| join max=0 [| makeresults count=7 | streamstats count | eval time = relative_time(_time - (count - 1) * 68400, "@d") | fields time]

The subsearch on its own just creates a number of rows with days as epoch values in the field time. Now all we need to do is calculate the duration per day that a site is down:

| eval outage_secs_from = if(eventtime <= (time + 86400), min((time + 86400) - eventtime, 86400), 0)
| eval outage_secs_to = if(clrtime > time, min(clrtime - time, 86400), 0)
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| fieldformat outage_total_duration = tostring(outage_secs_total, "duration")
| eval date = strftime(time, "%F")

This is the data you can now work with, here you'll see what will become your chart in a minute (hence the fieldformat outage_total_duration for readability). The next step transforms your data (similar to the xyseries in your example):

| chart sum(outage_secs_total) by sitename date

This would give you a sum of the outage seconds per sitename and day (I've used strftime with %F here, but you can also use any other string). I see you used some sort of count and mean by date and sitename in your exaple, right before the chart would be the place to do all kinds of calculations. If you wanted to have site availability as a percentage of seconds in a day, you would change those last lines to

| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| eval availability = round(100 * (86400 - outage_secs_total) / 86400, 2)
| eval date = strftime(time, "%F")
| chart avg(availability) by sitename date

I hope this is going in the right direction. Feel free to come back with any follow up questions.

View solution in original post

Champion

Based on your comments, its sufficient if the report covers downtime events only after they are cleared, meaning this won't show you downtimes that have begun in the past and are not cleared yet. Since you're using fields in the search provided above that I don't know about and I don't exactly understand what you're doing there, I'm just going to calculate things from what I know and try to come up with what I think you mean. I've used this to mock some data:

| makeresults | eval clrtime = 1554768400, eventtime = 1554761200, sitename="foo"
| append [| makeresults | eval clrtime = 1554867600, eventtime = 1554761100, sitename="bar"]
| append [| makeresults | eval clrtime = 1554768400, eventtime = 1554636473, sitename="baz"]

From here, I've used join with the max=0 setting to expand each event into multiple (based on the count setting of makeresults: change this to the number of days you want to show):

| join max=0 [| makeresults count=7 | streamstats count | eval time = relative_time(_time - (count - 1) * 68400, "@d") | fields time]

The subsearch on its own just creates a number of rows with days as epoch values in the field time. Now all we need to do is calculate the duration per day that a site is down:

| eval outage_secs_from = if(eventtime <= (time + 86400), min((time + 86400) - eventtime, 86400), 0)
| eval outage_secs_to = if(clrtime > time, min(clrtime - time, 86400), 0)
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| fieldformat outage_total_duration = tostring(outage_secs_total, "duration")
| eval date = strftime(time, "%F")

This is the data you can now work with, here you'll see what will become your chart in a minute (hence the fieldformat outage_total_duration for readability). The next step transforms your data (similar to the xyseries in your example):

| chart sum(outage_secs_total) by sitename date

This would give you a sum of the outage seconds per sitename and day (I've used strftime with %F here, but you can also use any other string). I see you used some sort of count and mean by date and sitename in your exaple, right before the chart would be the place to do all kinds of calculations. If you wanted to have site availability as a percentage of seconds in a day, you would change those last lines to

| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| eval availability = round(100 * (86400 - outage_secs_total) / 86400, 2)
| eval date = strftime(time, "%F")
| chart avg(availability) by sitename date

I hope this is going in the right direction. Feel free to come back with any follow up questions.

View solution in original post

Explorer

With this approach it's possible to expand the event only by the fixed number of new events, 7 in your case, defined by the count.
In my case I never know how long is the event, count must be variable calculated per each event. Can this be fixed?

0 Karma

Champion

Sure that's possible, we're just going to have to change how we create our events per day. Replace the join command with these three lines:

| eventstats max(clrtime) as max_clrtime min(eventtime) as min_eventtime
| eval time = mvrange(relative_time(min_eventtime, "@d"), relative_time(max_clrtime, "@d+1d"), "1d")
| mvexpand time

This creates a multivalue field with mvrange, specifically its ability to create date ranges. It excludes the end value, which is why we're adding an additional day to that parameter. The dates for this mvrange are your earliest and latest eventtime and clrtime values.
Be aware that this dynamic solution will give you columns for as many days as there are in your data, and that chart will create an OTHER column when it gets more than 10 values. Use chart limit=20 or some other value to increase this number.

0 Karma

Explorer

Thanks,
but the above does not create new events, if I'm not missing something.

0 Karma

Champion

It does, mvexpand creates new events from multivalue fields. Maybe I was unclear in what to replace:

| makeresults | eval clrtime = 1554768400, eventtime = 1554761200, sitename="foo"
| append [| makeresults | eval clrtime = 1554867600, eventtime = 1554761100, sitename="bar"]
| append [| makeresults | eval clrtime = 1554768400, eventtime = 1554636473, sitename="baz"]
| eventstats max(clrtime) as max_clrtime min(eventtime) as min_eventtime
| eval time = mvrange(relative_time(min_eventtime, "@d"), relative_time(max_clrtime, "@d+1d"), "1d")
| mvexpand time
| eval outage_secs_from = if(eventtime <= (time + 86400), min((time + 86400) - eventtime, 86400), 0)
| eval outage_secs_to = if(clrtime > time, min(clrtime - time, 86400), 0)
| eval outage_secs_total = min(outage_secs_from, outage_secs_to)
| eval date = strftime(time, "%F")
| chart sum(outage_secs_total) by sitename date

Explorer

jeffland,
I appreciate your help very much.
I had to add a single correction to your code, the events that are started and cleared on the same day were not calculated right. The following has fixed this problem.
| eval outage_secs_between = clrtime-eventtime
| eval outage_secs_total = min(outage_secs_from, outage_secs_to, outage_secs_between)

Regards!

0 Karma

Champion

Oh, yes you're right - I forgot that case. Happy to help!

0 Karma

Champion

When are your events created, at CLRTIME? I assume EVENTTIME and CLRTIME are timestamps?

0 Karma

Explorer

eventtime - event creation time, clrtime - event clearance time. For the particular site down event, I actually have two messages. First, when the site goes down, there's an event with a specific sitename and eventtime. clrtime is empty at this moment. When the site comes back online, there's another message, with the sitename and eventtime same as in the previous message and the crltime now is not empty and is he time site came back online.

0 Karma

Champion

And is there a common id in the two events? Or can we otherwise assume that if a combination of site name and eventtime is found only twice, once with an empty clrtime field and once with that field filled, that the two events refer to the same site down event?

0 Karma

Explorer

It would be easier to use a combination of the sitename and eventtime. There is some ID but it's not plain and needs a lot of regexing. If you think there's no other way I'll work on it. Actually we can use only the event with both eventtime and clrtime present and ignore the event with the empty clrtime field. I do not need all that running realtime, it's ok to have report when the event with the clrtime arrives.

So when it arrives, we'll have eventtime (start) and clrtime (event end).

Thanks,
NJ

0 Karma

Builder

Evaluate how you are calculating eventtime_u. You may need to adjust for that to have a globally unique event id that occurs at both the start and the stop of the event. Perhaps the transaction command (https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Transaction).

###

If this reply helps you, an upvote would be appreciated.
0 Karma

Explorer

eventime_u=clrtime-eventtime; it's set through the "calculated fields" from the Web GUI.
I may be able to extract the unique ID it's there in the raw data. It could give me a possibility to fix on the ceasing day. But it's still unclear what could be done on the day in the middle of the event.
I was thinking of makeresults command, but not sure.

0 Karma

Esteemed Legend

Give us your existing search. How are we supposed to help with only a description?

0 Karma

Explorer

I was looking for a hint or an idea. Here are all the details, anyways.

| search  (SPECPB="*LOSS-OF-ALL-CHAN*")
| dedup EVTTIME SPECPB

| eval bintime_u=eventtime_u
| bin bintime_u span=1d

| stats list(sectorname) as sectorname list(eventtime) as eventtime list(clrtime) as clrtime  list(eventduration_u) as eventduration_u sum(eventduration_u) as cumulative_sector_down   count(eventduration_u) as cumulative_event_count by bintime_u sitename 

| eval bintime=strftime(bintime_u,"%m-%d")

| eval mean_site_down=cumulative_sector_down/cumulative_event_count
| eval siteavailability_d=100-round(100*mean_site_down/86400,2)

| xyseries sitename,bintime,siteavailability_d

My events:

_time   sitename   sectorname   eventtime   clrtime  SPECPB  eventduration_u

So I'm getting correct results when the event lasts within a day (i.e. it starts after 00:00 and ends before 24:00 of the same day). If the event lasts longer then one day, starting on the day before and ending on the next day, the current day has obviously no event. I need to spread such events trough the following bins somehow.

Regards,
NJ

0 Karma