Alerting

Use splunk to track response time

gnovak
Builder

hello there,

I am trying to figure out the best way to possibly do the following task.

We run nagios and some of our nagios alerts are communicated via a Jabber bot. I would like to know how long it takes for an alert to be acknowledged from the time it fires and then from the time an analyst acknowledges it. Examples of some of the lines in the logs are:

Thu Mar 31 19:57:57 2011: listenLoop: raw line: PROBLEM*|*rg-da-dysonrep3*|*Check disk RG - data_dysonrep3*|*CRITICAL*|*1*|*DISK CRITICAL - free space: /foo/foobar/data_dysonrep3 4 MB...
Thu Mar 31 20:02:57 2011: listenLoop: raw line: PROBLEM*|*rg-da-dysonrep3*|*Check disk RG - data_dysonrep3*|*CRITICAL*|*2*|*DISK CRITICAL - free space: /foo/foobar/data_dysonrep3 4 MB...
Thu Mar 31 20:08:53 2011: listenLoop: raw line: ACKNOWLEDGEMENT*|*rg-da-dysonrep3*|*Check disk RG - data_dysonrep3*|*CRITICAL*|*2*|*ajohnson: 174782

These lines could be mixed in with other lines as well in the log. They aren't always in perfect order like this. What I would need to do is use splunk to find out how long it took for an analyst to acknowledge the alert after it first fired. In this case it looks like it took 11 minutes. The alert fired twice and then was acknowledged.

I've been researching different commands and I'm not sure if "diff" would be the one I'm looking for. I'm still trying to figure out how I could get splunk to match up every alert that fires with an acknowledgement, if there is one at all.

Any ideas?

Tags (1)
0 Karma

woodcock
Esteemed Legend

I disagree; transaction has very expensive resource costs and I would avoid it whenever possible. Try this:

my_alert_events | stats earliest(_time) AS startTime, latest(_time) AS endTime by alert_id | eval responseTime=endTime-startTime
0 Karma

hazekamp
Builder

gnovak,

I would recommend using the "transaction" command for this. You would need to extract a field (i.e. alert_id) which uniquely identifies your alerts (i.e. rg-da-dysonrep3). Then you would perform you search like so:

<my alert events> | transaction alert_id startswith=PROBLEM endswith=ACKNOWLEDGEMENT | stats max(duration) by alert_id

Ultimately "transaction" will give you a "duration" field that you can do a number of things with...like calculate average duration by analyst.

See also: http://www.splunk.com/base/Documentation/latest/SearchReference/Transaction

gnovak
Builder

After trying this so far this seems to be promising: sourcetype="jabber_nagios" NOT RECOVERY listenLoop | transaction startswith=PROBLEM endswith=ACKNOWLEDGEMENT
However I tried extracting the alert and hostname and putting this into the search causes the results to just look strange. I'll keep trying

0 Karma

dwaddle
SplunkTrust
SplunkTrust

Also, with transaction you can use "keepevicted=true" to tell transaction to include transactions that didn't "close" with an endswith= line. That causes transaction to include a binary field of closed_txn which you can search on to see what alerts were not acknowledged at all.

Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...