hello there,
I am trying to figure out the best way to possibly do the following task.
We run nagios and some of our nagios alerts are communicated via a Jabber bot. I would like to know how long it takes for an alert to be acknowledged from the time it fires and then from the time an analyst acknowledges it. Examples of some of the lines in the logs are:
Thu Mar 31 19:57:57 2011: listenLoop: raw line: PROBLEM*|*rg-da-dysonrep3*|*Check disk RG - data_dysonrep3*|*CRITICAL*|*1*|*DISK CRITICAL - free space: /foo/foobar/data_dysonrep3 4 MB...
Thu Mar 31 20:02:57 2011: listenLoop: raw line: PROBLEM*|*rg-da-dysonrep3*|*Check disk RG - data_dysonrep3*|*CRITICAL*|*2*|*DISK CRITICAL - free space: /foo/foobar/data_dysonrep3 4 MB...
Thu Mar 31 20:08:53 2011: listenLoop: raw line: ACKNOWLEDGEMENT*|*rg-da-dysonrep3*|*Check disk RG - data_dysonrep3*|*CRITICAL*|*2*|*ajohnson: 174782
These lines could be mixed in with other lines as well in the log. They aren't always in perfect order like this. What I would need to do is use splunk to find out how long it took for an analyst to acknowledge the alert after it first fired. In this case it looks like it took 11 minutes. The alert fired twice and then was acknowledged.
I've been researching different commands and I'm not sure if "diff" would be the one I'm looking for. I'm still trying to figure out how I could get splunk to match up every alert that fires with an acknowledgement, if there is one at all.
Any ideas?
I disagree; transaction has very expensive resource costs and I would avoid it whenever possible. Try this:
my_alert_events | stats earliest(_time) AS startTime, latest(_time) AS endTime by alert_id | eval responseTime=endTime-startTime
gnovak,
I would recommend using the "transaction" command for this. You would need to extract a field (i.e. alert_id) which uniquely identifies your alerts (i.e. rg-da-dysonrep3). Then you would perform you search like so:
<my alert events> | transaction alert_id startswith=PROBLEM endswith=ACKNOWLEDGEMENT | stats max(duration) by alert_id
Ultimately "transaction" will give you a "duration" field that you can do a number of things with...like calculate average duration by analyst.
See also: http://www.splunk.com/base/Documentation/latest/SearchReference/Transaction
After trying this so far this seems to be promising: sourcetype="jabber_nagios" NOT RECOVERY listenLoop | transaction startswith=PROBLEM endswith=ACKNOWLEDGEMENT
However I tried extracting the alert and hostname and putting this into the search causes the results to just look strange. I'll keep trying
Also, with transaction you can use "keepevicted=true" to tell transaction to include transactions that didn't "close" with an endswith= line. That causes transaction to include a binary field of closed_txn which you can search on to see what alerts were not acknowledged at all.