I have two searches, one search will produce icing...

bmanikya · ‎07-24-2024

I have two searches, one search will produce icinga problem alerts and other search will produce icinga recovery alerts. I wanted to compare host with State fields, if the icinga alert has been recovered within 15 minutes duration no action to be taken else execute script.

First search, below is the snippet.

Second query, below is the snippet

bmanikya · ‎07-24-2024

Need to compare Host with Start_time(Icinga Problem) and End_time(Icinga Recovery), if the alert has been recovered within SLA( i.e, 15 minutes) take action or else nothing. Any help is appreciated.

ITWhisperer · ‎07-24-2024

Please share some anonymised representative events so we can better understand what you are dealing with. Please use a code block </> so that they can be used to simulate your situation.

bmanikya · ‎07-24-2024

index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM"
| fields host 
| transaction host startswith="To:" 
| search "To: <Mail-Address>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=Subject "PROBLEM - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:" 
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as Start_time
| search isvm=N AND cluster=*EDGE*
| eval Start_time=strftime(Start_time, "%m/%d/%Y - %H:%M:%S")
| sort Start_time

For security reason, removed Mail-addr

ITWhisperer · ‎07-24-2024

So, all the information you need for a "transaction" is in one event? Why are you using the transaction command? What do the other events look like?

Again, it would be useful if you could share them in a code block </> like this

bmanikya · ‎07-24-2024

Please check the code, i have shared as requested. Its the same for Recovery search as well.

ITWhisperer · ‎07-24-2024

So, all the information you need for a "transaction" is in one event? Why are you using the transaction command? What do the other events look like?

bmanikya · ‎07-24-2024

Below is the search query for icinga Problem and events too.

Below is the search query for Icinga Recovery and events.

If you want me to get rid of transaction command, thats fine. I would like to group multiple events into a single meta-event that represents a single physical event.

ITWhisperer · ‎07-24-2024

Your recovery event doesn't seem to match the rex pattern you are applying to it. Are there other recovery events which do match? Do you want to ignore the recovery events which don't match the rex pattern?

P.S. You can leave the transaction command in if you like but I don't see what value it is giving you because all the information for the event appears to be in the single event (and therefore the transaction command is just wasting time and resources?).

bmanikya · ‎07-24-2024

index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM"
| fields host 
| transaction host startswith="To:" 
| search "To: <mail-addr>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=Subject "PROBLEM - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:" 
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as Start_time
| search isvm=N AND cluster=*EDGE*
| eval Start_time=strftime(Start_time, "%m/%d/%Y - %H:%M:%S")
| sort Start_time

index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "RECOVERY"
| fields host 
| transaction host startswith="To:" 
| search "To: <mail-addr>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=Subject "RECOVERY - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:" 
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as End_time
| search isvm=N AND cluster=*EDGE*
| eval End_time=strftime(End_time, "%m/%d/%Y - %H:%M:%S")
| sort End_time

No, recovery has events. As i said, one search will give us "Icinga Problem" and i have another search that will give us "Icinga Recovery". Using join, Icinga Problem Start time and Icinga Recovery End time, if the recovery is more than 15 minutes, need to trigger alert.

ITWhisperer · ‎07-25-2024

Please share the PROBLEM and RECOVERY events. (It is rather difficult to solve your problem without being able to see what events you are dealing with!)

bmanikya · ‎07-25-2024

Please check below snippet.

ITWhisperer · ‎07-25-2024

So, how does your rex command extract src_host_2, Service_2, and State_2 when they don't exist in the events?

bmanikya · ‎07-29-2024

index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM" OR "RECOVERY"
| fields host
| search "To: <mail-addr>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:"
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| eval event_type=if(match(_raw, "Subject: PROBLEM"), "PROBLEM", "RECOVERY")
| lookup hostdata_lookup.csv host as src_host 
| table _time src_host Service State event_type cluster isvm
| search cluster=*edge* AND isvm=N
| sort src_host Service _time
| streamstats current=f window=1 last(_time) as previous_time last(event_type) as previous_event_type by src_host Service
| eval previous_time=strftime(previous_time, "%m/%d/%Y - %H:%M:%S")

Below is the output of above query,

If the CRITICAL alert is not RECOVERED after 15minutes, we need to alert. Any help is appreciated.

ITWhisperer · ‎07-29-2024

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or are you just interested in whether the last problem (without a recovery) was over 15 minutes ago?

Can you get multiple problems (without recovery) events for the same problem i.e. do you need to know when the latest (or any) problem started (and whether it was fixed within 15 minutes)?

bmanikya · ‎07-29-2024

Please find my answers in BOLD

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or are you just interested in whether the last problem (without a recovery) was over 15 minutes ago? YES

Can you get multiple problems (without recovery) events for the same problem i.e. do you need to know when the latest (or any) problem started (and whether it was fixed within 15 minutes)? CORRECT

ITWhisperer · ‎07-29-2024

Please clarify your requirements.

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or later?

Are you only interested in whether the last problem (without a recovery) was over 15 minutes ago?

Can you get multiple problems (without recovery) events for the same problem?

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM first occurs?

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM last occurs?

How far back are you looking for these events?

How often are you looking for these events?

bmanikya · ‎07-29-2024

Please find my answers in bold.

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or later? If the PROBLEM alert is not RECOVERED after 15minutes, we need to trigger a script.

Are you only interested in whether the last problem (without a recovery) was over 15 minutes ago? YES

Can you get multiple problems (without recovery) events for the same problem? Yes, I am running this on edge nodes which are limited hosts. It could be multiple hosts as well.

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM first occurs? YES

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM last occurs? NO

How far back are you looking for these events? last 30 minutes

How often are you looking for these events? Every 15 minutes

Can you check below snippet as well,

ITWhisperer · ‎07-29-2024

Try something like this

...
| sort src_host Service _time
| streamstats current=f window=1 last(event_type) as previous_event_type by src_host Service
| eval problem_start=if(event_type="PROBLEM" AND (isnull(previous_event_type) OR previous_event_type != "PROBLEM"),_time,null())
| streamstats max(problem_start) as problem_start by src_host Service global=f
| eval problem_time=if(event_type="PROBLEM" OR previous_event_type="PROBLEM",_time-problem_start,null())
| where problem_time > 900

I have two searches, one search will produce icinga problem alerts and other search will produce icinga recovery alerts.

chart

count

eval

field extraction

fields

join

lookup

regex

subsearch

table

timechart

transaction

Fueling your curiosity with new Splunk ILT and eLearning courses

Splunk AI Assistant for SPL 1.1.0 | Now Personalized to Your Environment for Greater ...

Unleash Unified Security and Observability with Splunk Cloud Platform