Alerting

As part of a Splunk alert, is it possible to include 100 lines from the log prior to the event that triggered the alert?

Explorer

I want to set up a Splunk Alert when an ERROR string is seen. As part of the Alert, I want to send 100 lines in the log prior to the event. Is that doable?

Tags (2)
1 Solution

SplunkTrust
SplunkTrust

I can't tell if this is brilliant or if it's the worst level of horribad, but I found something that may work. I apologize in advance if this makes your server explode or the universe collapse in on itself due to its inefficiency. Still, I think it works, so give it a try!

I don't have your search nor your data so I had to mock up an answer on my own data, which is very much not like yours. Anyway, here's what I came up with.

Please ignore my root search - it's my firewall logs and I rex up a field I found that appears in about 1.5% of the events just so I had something to work with. That's the first line. I include it so you can see where my answer is coming from.

host=<myhost>| rex "^[^>\n]*>(?P<bad_cksum>[^\)]+)" 
| streamstats window=100 last(bad_cksum) as trigger_error
| transaction trigger_error

The streamstats is where the magic happens. Streamstats is keeping a 100-event window as these events are streaming through it. When it finds a "badcksum" defined in the last (most recent) event it's looking at in any particular group of 100, it copies that badcksum amount to the other 99 events as field "trigger_error".

If you want to take a look at the data before the transaction to see what the streamstats does, drop off the transaction command and put in a table blah blah bad_cksum trigger_error and scroll through the results for a while until you find a section where it "did its thing."

Once you have this new field which is mostly non-existent except for when the error occurs and for the 99 preceding events, we just build a transaction out of that.

I have tested and this appears to work. Because of how I'm using it, bad_cksum must be a field that's defined only when it exists. In your case you may want an eval that creates a concatenated field of several key/unique items from the event whenever it sees the word ERROR. I'll handwave that away for the moment hoping you can figure something out. If not, we'll definitely need some sample data and sample searches to work from.

Super-helpful-hint: you may want to test it on smallish data sets first - find a few hour time-frame, a few minutes or whatever where you only have a few errors to test with. Scale up from there.

And lastly, I based this sort of off an answer I came up with to another problem a while back. In that Answers post is another take on streamstats that may be useful to help understand what its doing.

View solution in original post

SplunkTrust
SplunkTrust

I can't tell if this is brilliant or if it's the worst level of horribad, but I found something that may work. I apologize in advance if this makes your server explode or the universe collapse in on itself due to its inefficiency. Still, I think it works, so give it a try!

I don't have your search nor your data so I had to mock up an answer on my own data, which is very much not like yours. Anyway, here's what I came up with.

Please ignore my root search - it's my firewall logs and I rex up a field I found that appears in about 1.5% of the events just so I had something to work with. That's the first line. I include it so you can see where my answer is coming from.

host=<myhost>| rex "^[^>\n]*>(?P<bad_cksum>[^\)]+)" 
| streamstats window=100 last(bad_cksum) as trigger_error
| transaction trigger_error

The streamstats is where the magic happens. Streamstats is keeping a 100-event window as these events are streaming through it. When it finds a "badcksum" defined in the last (most recent) event it's looking at in any particular group of 100, it copies that badcksum amount to the other 99 events as field "trigger_error".

If you want to take a look at the data before the transaction to see what the streamstats does, drop off the transaction command and put in a table blah blah bad_cksum trigger_error and scroll through the results for a while until you find a section where it "did its thing."

Once you have this new field which is mostly non-existent except for when the error occurs and for the 99 preceding events, we just build a transaction out of that.

I have tested and this appears to work. Because of how I'm using it, bad_cksum must be a field that's defined only when it exists. In your case you may want an eval that creates a concatenated field of several key/unique items from the event whenever it sees the word ERROR. I'll handwave that away for the moment hoping you can figure something out. If not, we'll definitely need some sample data and sample searches to work from.

Super-helpful-hint: you may want to test it on smallish data sets first - find a few hour time-frame, a few minutes or whatever where you only have a few errors to test with. Scale up from there.

And lastly, I based this sort of off an answer I came up with to another problem a while back. In that Answers post is another take on streamstats that may be useful to help understand what its doing.

View solution in original post

Explorer

Unfortunately this does n't work for me.
Here's the example log data
09-15-2015 10:00:00 INFO blah1
09-15-2015 10:00:01 INFO blah2
09-15-2015 10:00:02 INFO blah3
09-15-2015 10:00:03 WARN blah4
09-15-2015 10:00:04 INFO blah5
09-15-2015 10:00:04 ERROR blah6

index=myapp host="myhost" ERROR | streamstats window=10 last(ERROR) as triggererror| transaction triggererror

Does n't yield anything 😞

0 Karma

SplunkTrust
SplunkTrust

That's exactly my point I was talking yesterday; if your base search is filtering for ERROR you already lost the _raw for INFO and WARN and streamstats or autoregress will not be able to find anything besides events containing ERROR. Use this as base search and you should be able to find something:

index=myapp host="myhost" ERROR OR INFO OR WARN | streamstats window=10 last(ERROR) as trigger_error| transaction trigger_error

still un-tested so don't blame me if it's not working right out of the box 🙂

SplunkTrust
SplunkTrust

Yes, I forgot to call out the fact that your root search has to be fairly broad. If you limit it to only your ERROR lines, well, then, they're gone and it's very hard to include them afterwards.

MuS is on the right track in using the streamstats to provide your new field, but in your specific case this might have to be taken one step further and we may need to create a unique field to operate on so transaction doesn't get goofy upon overlap or non-beautiful data. Most data isn't beautiful. It's a shame, really. Anyway.

We can use eval to do this. I think this is how you'd do it. I probably messed up the rex, and I'm sure it can be improved on.

index=myapp host=myhost ERROR OR INFO OR WARN 
| rex "^\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} (?<level>\w+) (?<message>.*)"
| eval my_working_field=if(level=="ERROR",_time."-ERROR",null())
| streamstats window=100 last(my_working_field) as trigger_error
| transaction trigger_error

The rex creates new fields out of your "level" and "message" (level being ERROR, WARN, INFO in your example and message being everything else in the line). I do this because it seems more efficient when I have to eval against that in the next line. If you already have this as a field, great - use that. Otherwise, this will do.

The eval asks, "is level equal to ERROR?" If so, then create a new field "myworkingfield" as a concatenation of the _time field and the word "-Error" (like "1442777888.591122-ERROR"). This will be unique. If its level isn't ERROR, then it doesn't create the field AT ALL (that's the "null()" part). This is important.

Then the last, streamstats and transaction are largely unchanged.

This should work, though obviously it may need minor tweaking. And, I'm not convinced it'll behave perfectly if you have ERRORs happen closer than 100 events from each other. It might, and it might not. I just don't know, but I'd love to hear back!

Explorer

Thanks.. I missed this.. Will try today and let you know. Thanks a lot for coming up with brilliant solution in this area.

0 Karma

Community Manager
Community Manager

Hi @cybrainx

Was wondering if you'd be able to update whether or not @rich7177's solution worked for you. If yes, please accept it! If no, please update with more details. Thanks!

0 Karma

Communicator

This is pretty great. Saving this off. Thanks

0 Karma

SplunkTrust
SplunkTrust

Nice one - I'll keep it on my favorites list!

0 Karma

SplunkTrust
SplunkTrust

How many errors entries will be there in the Alert?? I can suggest some workaround but need to know the count and current alert search to see if it's even feasible with that workaround.

0 Karma

Esteemed Legend

You could use autoregress like this (tested; it works):

<YOUR BASE SEARCH HERE> | eval raw=_raw | autoregress raw p=1-100 | table raw* | <YOUR SEARCH CONDITIONS HERE>

http://docs.splunk.com/Documentation/Splunk/6.2.6/SearchReference/Autoregress

0 Karma

Explorer

Thanks to your response, autoregress did not work. This seems to a basic requirement. I am sure there is a a way.
I tried "tail 100", But that shows last 100 times when ERROR occured. I want all the lines prior which indicates logs that can hold clue to cause of the error. This would help us react to issues faster.

0 Karma

SplunkTrust
SplunkTrust

As I said, the problem is after any filter applied to the base search the _raw events previous to this event are gone. Wait for @somesoni2 and his solution.

0 Karma

Esteemed Legend

I am not saying it is the best solution (certainly the one by @rich7177 is better) but this will work. Copying his example, it would work like this:

host=<myhost> | rex "^[^>\n]*>(?P<bad_cksum>[^\)]+)"  | eval raw=_raw | autoregress raw p=1-100 | where isnotnull(bad_cksum) | table raw*

This approach will cause considerable ram use (every event searched will expand 100-fold), but it does work.
I do get your point, though; we cannot do the autoregress after the discrimination because then we will not have the adjacent events which means we horribly bloat the ram to do the search (kind of like transaction does).

0 Karma

SplunkTrust
SplunkTrust

This will and cannot work; as soon as you hit the condition of the alert _raw is only available for the alert condition event.

0 Karma

Esteemed Legend

Did you test? I admit that I did not so my answer is a little off but it does indeed work (I just tested it); do it like this:

<YOUR BASE SEARCH HERE> | eval raw=_raw | autoregress raw p=1-100 | table raw* | <YOUR SEARCH CONDITIONS HERE>

In this case, a single event is returned but it has 100 fields (raw_p1, raw_p2, ..., raw_p100).
Original answer updated.

0 Karma

SplunkTrust
SplunkTrust

Yes, I did test it. This may work this way, but as soon as you have a filter on the base search (this is what an alert will be) it does not work.

Try this:

index=_internal kbps>=10 | eval raw=_raw | autoregress raw p=1-100 | table raw*

All previous events, are events which hit the alert condition as well.

0 Karma

Esteemed Legend

Exactly, it works just the way I would expect. In what way does it not work for you? Obviously the fields passed to the table command will need to be adjusted so that the necessary filter conditions can be put inside the search (i.e. in my ``) instead of in the alert configuration but that is a trivial hurdle.

0 Karma

SplunkTrust
SplunkTrust

The question was I want to send 100 lines in the log prior to the event. which is no longer possible after the filter on the base search because all the _raw events are events which meet the filter. Therefore you will never be able to see the 100 lines in the log prior to the event, that's the point. autoregress works but it will not be able to return the surrounding 100 events. I'm surprised you did not go for a map solution this time.

0 Karma

Esteemed Legend

The 100 previous events are returned inside of 100 fields of the event that is searched and can be explicitly carried along with whatever further logic is postpended to qualify the thresholds of the alert. The solution works.

0 Karma

Explorer

Thanks I tested this. I am getting 100 new raw_ columns with empty values. Not sure whats happening.
Let me illustrate the problem, Here's the log file

09-15-2015 00:01:01 INFO: Blah I did something cool
09-15-2015 00:01:02 INFO: Blah I did something cool again
09-15-2015 00:01:03 WARN: Blah Somebody did, I hear it
09-15-2015 00:01:02 WARN: Blah That was'nt cool after all
09-15-2015 00:01:04 WARN: BLah Somebody did, I smell it
09-15-2015 00:01:05 INFO: Look at moon its shiny today
09-15-2015 00:01:08 ERROR: Ooppps wetted the pants

I set up a splunk alert on "ERROR", I want to get the entire history of previous line, along with the Alert

0 Karma