Alerting

Alert if last log message is X in last 30 minutes

greekleo89
Loves-to-Learn Everything

Hi All,

We are monitoring the same log file from multiple hosts and we have observed that when  a particular error gets logged the service of that machine stops, when this happens there is nothing else logged in the log file but the error,  The machine will try automatically to bring up the service, and if it does so successfully then other normal logs will follow.

Aim:
Our aim is to capture this particular error but only alert if that error is the last entry on this log file in the last 30 minutes or so.

Any help on this would be greatly appreciated.

For arguments sake the error looks like this:
***ERROR*** Exception occurred in serviceB_TDR

Labels (1)
Tags (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Assuming events are returned newest first, one way to do this would be:

| head 1
| search "***ERROR*** Exception occurred in serviceB_TDR"

 

0 Karma

greekleo89
Loves-to-Learn Everything

By that you mean events in splunks indexer right?

As as per the norm the newest event/entry is always at the bottom of the log file.

 

Thanks

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

I mean when you do a search in Splunk, which event comes back first - head 1 will just keep the first event in the pipeline.

0 Karma

greekleo89
Loves-to-Learn Everything

Cool i got you now - i need to make sure however that another event from another logfile isn't "counted" per se so for this i suppose i would just do a stats c by host so that the results are unique per host right?


All i am saying is that since we are getting these logs which could be duplicates from multiple hosts all in at the same time, if i do head 1 lets say for the last 30 mins, what would happen if one machine is down but the others are logging as normal, the error that i am looking for will not be the first in the pipeline as other normal messages from the other hosts would be?

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust
| stats latest(_raw) as _raw latest(_time) as _time by host
| search "***ERROR*** Exception occurred in serviceB_TDR"
0 Karma

greekleo89
Loves-to-Learn Everything

Thank you very much for the reply.

 

These logs come in every minute so with the above search what would happen in the following scenario:

Search runs every 30 minutes (as an example 00:00) and looks at the last 30 (23:30-00:00) , and it sees that in the last 30 minutes  for arguments sake at 23:59 that error appears and its the last line written - this will fire the alert however the logic i want to apply is that if it is the last message and it has been the last message for 30 minutes?

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Yes, your alert needs to check the value of _time (which is why it is included in the stats) to check how long ago it was.

If your alert is only running every half hour, the timeframe for your search should include the past hour.

0 Karma

greekleo89
Loves-to-Learn Everything

Great thank you.

Tags (1)
0 Karma

greekleo89
Loves-to-Learn Everything

The full log line is : 04/11/2022 17:47:58.846593 [Machine1] ***ERROR*** Exception occurred in serviceB_TDR

0 Karma