Splunk Search

Calculate duration of downtime if three or more consecutive events are marked down within an hour...

hansj
Explorer

Hi,

I am trying to make a service downtime calculation based on the following rules: If the service has the status 'error' for more than 3 consecutive events within an hour, measure the duration between status='error' and the last status='error' ( before the next 'ok' ).

I have pasted a log further down, and here's what I want the results to become:
Snippet 1: The downtime should be 1815 seconds
Snippet 2: 600 seconds
Snippet 3, part1: No downtime because there are only 2 status='down'
Snippet 3, part2: 3 seconds
Snippet 4: 3000 seconds

I've tried transaction- and streamstats command to do what I want, but I have still not had any success.

The following transaction commands nearly works as for grouping the events together:

... | eval down=if(status=="error",1,0) | transaction down maxspan=1h maxpause=5m

or

... | eval down=if(status=="error",1,0) | transaction startswith=down=1 endswith=down=0

The first transaction example include the maxpause parameter - which I don't really want. But then again leaving it out doesn't work either.
The second transaction example includes the first ok event after the last error, which is also something I don't want.

It would be great to have the transaction command receiving a parameter like 'ends_before=down=0' rather than endswith...

FYI, I have also touched upon the streamstats command to filter out similar statuses like so:

| streamstats window=1 current=f last(down) as previous_down by service |where NOT down=previous_down 

If I could get the transaction command to work, though, I could use the linecount field to exclude errors that are fewer than 3.

Does anybody have a solution or pointers in the right direction?

Thanks,
Hans J.

==== 8< ==== Log example =====
... Snippet 1
2014-09-08T00:14:00,ok
2014-09-08T00:30:00,error
2014-09-08T01:01:00,error
2014-09-08T01:01:05,error
2014-09-08T01:01:10,error
2014-09-08T01:01:15,error
2014-09-08T01:10:00,ok
... Snippet 2
2014-09-08T02:00:35,ok
2014-09-08T02:55:00,error
2014-09-08T03:02:00,error
2014-09-08T03:05:00,error
2014-09-08T03:10:01,ok
... Snippet 3
2014-09-08T03:10:05,ok
2014-09-08T03:10:06,error
2014-09-08T03:10:07,error
2014-09-08T03:20:00,ok
2014-09-08T03:20:01,error
2014-09-08T03:20:02,error
2014-09-08T03:20:03,error
2014-09-08T03:20:04,ok
... Snippet 4
2014-09-08T03:20:09,ok
2014-09-08T04:00:00,error
2014-09-08T04:00:00,error
2014-09-08T04:05:00,error
2014-09-08T04:06:00,error
2014-09-08T04:07:00,error
2014-09-08T04:08:00,error
2014-09-08T04:09:00,error
2014-09-08T04:30:00,error
2014-09-08T04:35:00,error
2014-09-08T04:45:00,error
2014-09-08T04:50:00,error
2014-09-08T05:10:00,ok
...

Tags (1)
0 Karma

musskopf
Builder

Hey Hansj, if you wish to calculate the downtime, you should include the "OK" event to get the real duration... is very clear in your snippet 3 part 2:

2014-09-08T03:20:01,error   <- duration = 0s
2014-09-08T03:20:02,error   <- duration = 1s
2014-09-08T03:20:03,error   <- duration = 2s
2014-09-08T03:20:04,ok      <- duration = 3s

But if you still want to ignore the duration from the last error to its recover, removing the OK event, you could try something like:

... | reverse | streamstats current=f window=1 last(status) AS status_after by serviceName | eval is_last_error=IF(status_after=="OK",1,0) | reverse | transaction serviceName startswith"status=error" endswith="is_last_error=ok" | where eventcount >= 3

I haven't tried it, but it should be enough for you to identify the last Error before you have an "OK" for the same service... In the example you do need to reverse once to "see in the future" and then reverse back to work on the transactions. If you're not getting the events in the chronological sequence, you might replace the first "reverse" with "sort - _time", but it should add some load to your search.

Hope it helps, cheers!

0 Karma

hansj
Explorer

Thank you again, I will certainly try this!

Cheers!

0 Karma

hansj
Explorer

Thanks, yes the where eventcount > 3helps but I would want the transaction to end with the last error before the ok. Another way to think of it is that I want to group all the error chunks together. I.e. the command I want looks something like: ... | transaction startswith="status=error" ends_before="status=ok", so for snippet 1 i would get a duration of 31m15s.

0 Karma

musskopf
Builder

Hey Hansj, just a non-Splunk related comment/2cents:

... Snippet 1
2014-09-08T00:14:00,ok     <- duration = 00:00:00
"unknown state"
2014-09-08T00:30:00,error  <- duration = 00:00:00
2014-09-08T01:01:00,error  <- duration = 00:31:00
2014-09-08T01:01:05,error  <- duration = 00:31:05
2014-09-08T01:01:10,error  <- duration = 00:31:10
2014-09-08T01:01:15,error  <- duration = 00:31:15 (how can assume the downtime finished here?)
"unknown state"
2014-09-08T01:10:00,ok     <- duration = 00:40:00 (I would assume the downtime finished here)

I don't know from where you're collection this data. If it's from a monitoring system that tests a system in a regular basis I would definitely include the all duration til I get the "OK" as downtime.... Now if you're watching an application log and those errors are generated, lets say, from random human interactions, the time between the last "ERROR" and the following "OK" on the system is unknown. But the same situation happens from the last "OK" to the first "ERROR" which I'm already ignoring that period, so again, I would definitely include the all duration til I get the "OK" as downtime.

Anyway, I've posted an answer below which might help...

Cheers

0 Karma

hansj
Explorer

Thanks for hleping out markthompson and musskopf.

The system is not regularly tested and is based on mostly random human interactions and some scheduled scripts run at different times. As such there's plenty possibility there can be an "error" at 00:30 and an "ok" at 05:00, and in such a case I don't think it's correct to include 4h 30m as downtime, it is like you say an unknown state. But then again, I also get your point that the same situation happens from "ok" to the first "error" - which I guess would even the whole thing out in the long run.

Thanks for your 2 cents and - Cheers ( I've heard so many Australians say this, it's nice, so I think I am adopting it from now on ).

0 Karma

markthompson
Builder

You need to work out the possible eventualities that your chunk could end up with, and then use the following:
... | transaction startswith="status=error" endswith="Eventuality" OR "Eventuality"
and you could put all of the different variations down. Hope this helps.

0 Karma

musskopf
Builder

Hello,

I would stick with something like that:

... | transaction serviceName startswith"status=error" endswith="status=ok" | where eventcount > 3 | eval duration=IF(closed_txn==0,"still down, or do some time calculations like now()-_time",duration) | table _time,serviceName,duration

You basically use the "eventcount" to get only the transactions with more than 3 errors. Does it helps?

Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...