I have two searches, one search will produce icinga problem alerts and other search will produce icinga recovery alerts. I wanted to compare host with State fields, if the icinga alert has been recovered within 15 minutes duration no action to be taken else execute script.
First search, below is the snippet.
Second query, below is the snippet
Need to compare Host with Start_time(Icinga Problem) and End_time(Icinga Recovery), if the alert has been recovered within SLA( i.e, 15 minutes) take action or else nothing. Any help is appreciated.
Please share some anonymised representative events so we can better understand what you are dealing with. Please use a code block </> so that they can be used to simulate your situation.
index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM"
| fields host
| transaction host startswith="To:"
| search "To: <Mail-Address>"
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)"
| rex field=_raw "Subject: (?<Subject>.*)"
| rex field=Subject "PROBLEM - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)"
| rex field=_raw "(?<Additional_Info>.*)\nTo:"
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1)
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as Start_time
| search isvm=N AND cluster=*EDGE*
| eval Start_time=strftime(Start_time, "%m/%d/%Y - %H:%M:%S")
| sort Start_time
For security reason, removed Mail-addr
So, all the information you need for a "transaction" is in one event? Why are you using the transaction command? What do the other events look like?
Again, it would be useful if you could share them in a code block </> like this
Please check the code, i have shared as requested. Its the same for Recovery search as well.
So, all the information you need for a "transaction" is in one event? Why are you using the transaction command? What do the other events look like?
Below is the search query for icinga Problem and events too.
Below is the search query for Icinga Recovery and events.
If you want me to get rid of transaction command, thats fine. I would like to group multiple events into a single meta-event that represents a single physical event.
Your recovery event doesn't seem to match the rex pattern you are applying to it. Are there other recovery events which do match? Do you want to ignore the recovery events which don't match the rex pattern?
P.S. You can leave the transaction command in if you like but I don't see what value it is giving you because all the information for the event appears to be in the single event (and therefore the transaction command is just wasting time and resources?).
index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM"
| fields host
| transaction host startswith="To:"
| search "To: <mail-addr>"
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)"
| rex field=_raw "Subject: (?<Subject>.*)"
| rex field=Subject "PROBLEM - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)"
| rex field=_raw "(?<Additional_Info>.*)\nTo:"
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1)
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as Start_time
| search isvm=N AND cluster=*EDGE*
| eval Start_time=strftime(Start_time, "%m/%d/%Y - %H:%M:%S")
| sort Start_time
index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "RECOVERY"
| fields host
| transaction host startswith="To:"
| search "To: <mail-addr>"
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)"
| rex field=_raw "Subject: (?<Subject>.*)"
| rex field=Subject "RECOVERY - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)"
| rex field=_raw "(?<Additional_Info>.*)\nTo:"
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1)
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as End_time
| search isvm=N AND cluster=*EDGE*
| eval End_time=strftime(End_time, "%m/%d/%Y - %H:%M:%S")
| sort End_time
No, recovery has events. As i said, one search will give us "Icinga Problem" and i have another search that will give us "Icinga Recovery". Using join, Icinga Problem Start time and Icinga Recovery End time, if the recovery is more than 15 minutes, need to trigger alert.
Please share the PROBLEM and RECOVERY events. (It is rather difficult to solve your problem without being able to see what events you are dealing with!)
Please check below snippet.
So, how does your rex command extract src_host_2, Service_2, and State_2 when they don't exist in the events?
index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM" OR "RECOVERY"
| fields host
| search "To: <mail-addr>"
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)"
| rex field=_raw "Subject: (?<Subject>.*)"
| rex field=_raw "(?<Additional_Info>.*)\nTo:"
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1)
| eval event_type=if(match(_raw, "Subject: PROBLEM"), "PROBLEM", "RECOVERY")
| lookup hostdata_lookup.csv host as src_host
| table _time src_host Service State event_type cluster isvm
| search cluster=*edge* AND isvm=N
| sort src_host Service _time
| streamstats current=f window=1 last(_time) as previous_time last(event_type) as previous_event_type by src_host Service
| eval previous_time=strftime(previous_time, "%m/%d/%Y - %H:%M:%S")
Below is the output of above query,
If the CRITICAL alert is not RECOVERED after 15minutes, we need to alert. Any help is appreciated.
Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or are you just interested in whether the last problem (without a recovery) was over 15 minutes ago?
Can you get multiple problems (without recovery) events for the same problem i.e. do you need to know when the latest (or any) problem started (and whether it was fixed within 15 minutes)?
Please find my answers in BOLD
Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or are you just interested in whether the last problem (without a recovery) was over 15 minutes ago? YES
Can you get multiple problems (without recovery) events for the same problem i.e. do you need to know when the latest (or any) problem started (and whether it was fixed within 15 minutes)? CORRECT
Please clarify your requirements.
Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or later?
Are you only interested in whether the last problem (without a recovery) was over 15 minutes ago?
Can you get multiple problems (without recovery) events for the same problem?
Does the 15 minutes start when the PROBLEM event for the latest PROBLEM first occurs?
Does the 15 minutes start when the PROBLEM event for the latest PROBLEM last occurs?
How far back are you looking for these events?
How often are you looking for these events?
Please find my answers in bold.
Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or later? If the PROBLEM alert is not RECOVERED after 15minutes, we need to trigger a script.
Are you only interested in whether the last problem (without a recovery) was over 15 minutes ago? YES
Can you get multiple problems (without recovery) events for the same problem? Yes, I am running this on edge nodes which are limited hosts. It could be multiple hosts as well.
Does the 15 minutes start when the PROBLEM event for the latest PROBLEM first occurs? YES
Does the 15 minutes start when the PROBLEM event for the latest PROBLEM last occurs? NO
How far back are you looking for these events? last 30 minutes
How often are you looking for these events? Every 15 minutes
Can you check below snippet as well,
Try something like this
...
| sort src_host Service _time
| streamstats current=f window=1 last(event_type) as previous_event_type by src_host Service
| eval problem_start=if(event_type="PROBLEM" AND (isnull(previous_event_type) OR previous_event_type != "PROBLEM"),_time,null())
| streamstats max(problem_start) as problem_start by src_host Service global=f
| eval problem_time=if(event_type="PROBLEM" OR previous_event_type="PROBLEM",_time-problem_start,null())
| where problem_time > 900