I want to set up a Splunk Alert when an ERROR string is seen. As part of the Alert, I want to send 100 lines in the log prior to the event. Is that doable?
I can't tell if this is brilliant or if it's the worst level of horribad, but I found something that may work. I apologize in advance if this makes your server explode or the universe collapse in on itself due to its inefficiency. Still, I think it works, so give it a try!
I don't have your search nor your data so I had to mock up an answer on my own data, which is very much not like yours. Anyway, here's what I came up with.
Please ignore my root search - it's my firewall logs and I rex up a field I found that appears in about 1.5% of the events just so I had something to work with. That's the first line. I include it so you can see where my answer is coming from.
host=<myhost>| rex "^[^>\n]*>(?P<bad_cksum>[^\)]+)"
| streamstats window=100 last(bad_cksum) as trigger_error
| transaction trigger_error
The streamstats is where the magic happens. Streamstats is keeping a 100-event window as these events are streaming through it. When it finds a "bad_cksum" defined in the last (most recent) event it's looking at in any particular group of 100, it copies that bad_cksum amount to the other 99 events as field "trigger_error".
If you want to take a look at the data before the transaction to see what the streamstats does, drop off the transaction command and put in a table blah blah bad_cksum trigger_error
and scroll through the results for a while until you find a section where it "did its thing."
Once you have this new field which is mostly non-existent except for when the error occurs and for the 99 preceding events, we just build a transaction out of that.
I have tested and this appears to work. Because of how I'm using it, bad_cksum must be a field that's defined only when it exists. In your case you may want an eval that creates a concatenated field of several key/unique items from the event whenever it sees the word ERROR. I'll handwave that away for the moment hoping you can figure something out. If not, we'll definitely need some sample data and sample searches to work from.
Super-helpful-hint: you may want to test it on smallish data sets first - find a few hour time-frame, a few minutes or whatever where you only have a few errors to test with. Scale up from there.
And lastly, I based this sort of off an answer I came up with to another problem a while back. In that Answers post is another take on streamstats that may be useful to help understand what its doing.
I can't tell if this is brilliant or if it's the worst level of horribad, but I found something that may work. I apologize in advance if this makes your server explode or the universe collapse in on itself due to its inefficiency. Still, I think it works, so give it a try!
I don't have your search nor your data so I had to mock up an answer on my own data, which is very much not like yours. Anyway, here's what I came up with.
Please ignore my root search - it's my firewall logs and I rex up a field I found that appears in about 1.5% of the events just so I had something to work with. That's the first line. I include it so you can see where my answer is coming from.
host=<myhost>| rex "^[^>\n]*>(?P<bad_cksum>[^\)]+)"
| streamstats window=100 last(bad_cksum) as trigger_error
| transaction trigger_error
The streamstats is where the magic happens. Streamstats is keeping a 100-event window as these events are streaming through it. When it finds a "bad_cksum" defined in the last (most recent) event it's looking at in any particular group of 100, it copies that bad_cksum amount to the other 99 events as field "trigger_error".
If you want to take a look at the data before the transaction to see what the streamstats does, drop off the transaction command and put in a table blah blah bad_cksum trigger_error
and scroll through the results for a while until you find a section where it "did its thing."
Once you have this new field which is mostly non-existent except for when the error occurs and for the 99 preceding events, we just build a transaction out of that.
I have tested and this appears to work. Because of how I'm using it, bad_cksum must be a field that's defined only when it exists. In your case you may want an eval that creates a concatenated field of several key/unique items from the event whenever it sees the word ERROR. I'll handwave that away for the moment hoping you can figure something out. If not, we'll definitely need some sample data and sample searches to work from.
Super-helpful-hint: you may want to test it on smallish data sets first - find a few hour time-frame, a few minutes or whatever where you only have a few errors to test with. Scale up from there.
And lastly, I based this sort of off an answer I came up with to another problem a while back. In that Answers post is another take on streamstats that may be useful to help understand what its doing.
Unfortunately this does n't work for me.
Here's the example log data
09-15-2015 10:00:00 INFO blah1
09-15-2015 10:00:01 INFO blah2
09-15-2015 10:00:02 INFO blah3
09-15-2015 10:00:03 WARN blah4
09-15-2015 10:00:04 INFO blah5
09-15-2015 10:00:04 ERROR blah6
index=myapp host="myhost" ERROR | streamstats window=10 last(ERROR) as trigger_error| transaction trigger_error
Does n't yield anything 😞
That's exactly my point I was talking yesterday; if your base search is filtering for ERROR
you already lost the _raw
for INFO
and WARN
and streamstats
or autoregress
will not be able to find anything besides events containing ERROR
. Use this as base search and you should be able to find something:
index=myapp host="myhost" ERROR OR INFO OR WARN | streamstats window=10 last(ERROR) as trigger_error| transaction trigger_error
still un-tested so don't blame me if it's not working right out of the box 🙂
Yes, I forgot to call out the fact that your root search has to be fairly broad. If you limit it to only your ERROR lines, well, then, they're gone and it's very hard to include them afterwards.
MuS is on the right track in using the streamstats to provide your new field, but in your specific case this might have to be taken one step further and we may need to create a unique field to operate on so transaction doesn't get goofy upon overlap or non-beautiful data. Most data isn't beautiful. It's a shame, really. Anyway.
We can use eval to do this. I think this is how you'd do it. I probably messed up the rex, and I'm sure it can be improved on.
index=myapp host=myhost ERROR OR INFO OR WARN
| rex "^\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} (?<level>\w+) (?<message>.*)"
| eval my_working_field=if(level=="ERROR",_time."-ERROR",null())
| streamstats window=100 last(my_working_field) as trigger_error
| transaction trigger_error
The rex creates new fields out of your "level" and "message" (level being ERROR, WARN, INFO in your example and message being everything else in the line). I do this because it seems more efficient when I have to eval against that in the next line. If you already have this as a field, great - use that. Otherwise, this will do.
The eval asks, "is level equal to ERROR?" If so, then create a new field "my_working_field" as a concatenation of the _time field and the word "-Error" (like "1442777888.591122-ERROR"). This will be unique. If its level isn't ERROR, then it doesn't create the field AT ALL (that's the "null()" part). This is important.
Then the last, streamstats and transaction are largely unchanged.
This should work, though obviously it may need minor tweaking. And, I'm not convinced it'll behave perfectly if you have ERRORs happen closer than 100 events from each other. It might, and it might not. I just don't know, but I'd love to hear back!
Thanks.. I missed this.. Will try today and let you know. Thanks a lot for coming up with brilliant solution in this area.
Hi @cybrainx
Was wondering if you'd be able to update whether or not @rich7177's solution worked for you. If yes, please accept it! If no, please update with more details. Thanks!
This is pretty great. Saving this off. Thanks
Nice one - I'll keep it on my favorites list!
How many errors entries will be there in the Alert?? I can suggest some workaround but need to know the count and current alert search to see if it's even feasible with that workaround.
You could use autoregress
like this (tested; it works):
<YOUR BASE SEARCH HERE> | eval raw=_raw | autoregress raw p=1-100 | table raw* | <YOUR SEARCH CONDITIONS HERE>
http://docs.splunk.com/Documentation/Splunk/6.2.6/SearchReference/Autoregress
Thanks to your response, autoregress did not work. This seems to a basic requirement. I am sure there is a a way.
I tried "tail 100", But that shows last 100 times when ERROR occured. I want all the lines prior which indicates logs that can hold clue to cause of the error. This would help us react to issues faster.
As I said, the problem is after any filter applied to the base search the _raw
events previous to this event are gone. Wait for @somesoni2 and his solution.
I am not saying it is the best solution (certainly the one by @rich7177 is better) but this will work. Copying his example, it would work like this:
host=<myhost> | rex "^[^>\n]*>(?P<bad_cksum>[^\)]+)" | eval raw=_raw | autoregress raw p=1-100 | where isnotnull(bad_cksum) | table raw*
This approach will cause considerable ram use (every event searched will expand 100-fold), but it does work.
I do get your point, though; we cannot do the autoregress
after the discrimination because then we will not have the adjacent events which means we horribly bloat the ram to do the search (kind of like transaction
does).
This will and cannot work; as soon as you hit the condition of the alert _raw
is only available for the alert condition event.
Did you test? I admit that I did not so my answer is a little off but it does indeed work (I just tested it); do it like this:
<YOUR BASE SEARCH HERE> | eval raw=_raw | autoregress raw p=1-100 | table raw* | <YOUR SEARCH CONDITIONS HERE>
In this case, a single event is returned but it has 100 fields (raw_p1
, raw_p2
, ..., raw_p100
).
Original answer updated.
Yes, I did test it. This may work this way, but as soon as you have a filter on the base search (this is what an alert will be) it does not work.
Try this:
index=_internal kbps>=10 | eval raw=_raw | autoregress raw p=1-100 | table raw*
All previous events, are events which hit the alert condition as well.
Exactly, it works just the way I would expect. In what way does it not work for you? Obviously the fields passed to the table
command will need to be adjusted so that the necessary filter conditions can be put inside the search (i.e. in my ``) instead of in the alert configuration but that is a trivial hurdle.
The question was I want to send 100 lines in the log prior to the event.
which is no longer possible after the filter on the base search because all the _raw
events are events which meet the filter. Therefore you will never be able to see the 100 lines in the log prior to the event
, that's the point. autoregress
works but it will not be able to return the surrounding 100 events. I'm surprised you did not go for a map
solution this time.
The 100 previous events are returned inside of 100 fields of the event that is searched and can be explicitly carried along with whatever further logic is postpended to qualify the thresholds of the alert. The solution works.
Thanks I tested this. I am getting 100 new raw_ columns with empty values. Not sure whats happening.
Let me illustrate the problem, Here's the log file
09-15-2015 00:01:01 INFO: Blah I did something cool
09-15-2015 00:01:02 INFO: Blah I did something cool again
09-15-2015 00:01:03 WARN: Blah Somebody did, I hear it
09-15-2015 00:01:02 WARN: Blah That was'nt cool after all
09-15-2015 00:01:04 WARN: BLah Somebody did, I smell it
09-15-2015 00:01:05 INFO: Look at moon its shiny today
09-15-2015 00:01:08 ERROR: Ooppps wetted the pants
I set up a splunk alert on "ERROR", I want to get the entire history of previous line, along with the Alert