Splunk Search

I have two searches, one search will produce icinga problem alerts and other search will produce icinga recovery alerts.

bmanikya
Loves-to-Learn Everything

I have two searches, one search will produce icinga problem alerts and other search will produce icinga recovery alerts. I wanted to compare host with State fields, if the icinga alert has been recovered within 15 minutes duration no action to be taken else execute script.

First search, below is the snippet.

bmanikya_0-1721806017759.png

 

Second query, below is the snippet

bmanikya_1-1721806075088.png

 

 

0 Karma

bmanikya
Loves-to-Learn Everything

Need to compare Host with Start_time(Icinga Problem) and End_time(Icinga Recovery), if the alert has been recovered within SLA( i.e, 15 minutes) take action or else nothing. Any help is appreciated.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Please share some anonymised representative events so we can better understand what you are dealing with. Please use a code block </> so that they can be used to simulate your situation.

0 Karma

bmanikya
Loves-to-Learn Everything

 

 

index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM"
| fields host 
| transaction host startswith="To:" 
| search "To: <Mail-Address>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=Subject "PROBLEM - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:" 
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as Start_time
| search isvm=N AND cluster=*EDGE*
| eval Start_time=strftime(Start_time, "%m/%d/%Y - %H:%M:%S")
| sort Start_time

 

For security reason, removed Mail-addr

bmanikya_0-1721810752116.png

 

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

So, all the information you need for a "transaction" is in one event? Why are you using the transaction command? What do the other events look like? 

Again, it would be useful if you could share them in a code block </> like this
0 Karma

bmanikya
Loves-to-Learn Everything

Please check the code, i have shared as requested. Its the same for Recovery search as well.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

So, all the information you need for a "transaction" is in one event? Why are you using the transaction command? What do the other events look like? 

0 Karma

bmanikya
Loves-to-Learn Everything

Below is the search query for icinga Problem and events too.

bmanikya_0-1721819520919.png

 

Below is the search query for Icinga Recovery and events.

 

bmanikya_1-1721819609023.png

 

If you want me to get rid of transaction command, thats fine. I would like to group multiple events into a single meta-event that represents a single physical event.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Your recovery event doesn't seem to match the rex pattern you are applying to it. Are there other recovery events which do match? Do you want to ignore the recovery events which don't match the rex pattern?

P.S. You can leave the transaction command in if you like but I don't see what value it is giving you because all the information for the event appears to be in the single event (and therefore the transaction command is just wasting time and resources?).

0 Karma

bmanikya
Loves-to-Learn Everything
index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM"
| fields host 
| transaction host startswith="To:" 
| search "To: <mail-addr>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=Subject "PROBLEM - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:" 
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as Start_time
| search isvm=N AND cluster=*EDGE*
| eval Start_time=strftime(Start_time, "%m/%d/%Y - %H:%M:%S")
| sort Start_time

 

index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "RECOVERY"
| fields host 
| transaction host startswith="To:" 
| search "To: <mail-addr>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=Subject "RECOVERY - (?<src_host_2>.*) - (?<Service_2>.*) is (?<State_2>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:" 
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| fields host ,Service,src_host,State,Subject,Additional_Info
| lookup hostdata_lookup.csv host as src_host
| table src_host,Service,State,_time, cluster, isvm
| rename _time as End_time
| search isvm=N AND cluster=*EDGE*
| eval End_time=strftime(End_time, "%m/%d/%Y - %H:%M:%S")
| sort End_time

 

No, recovery has events. As i said, one search will give us "Icinga Problem" and i have another search that will give us "Icinga Recovery". Using join, Icinga Problem Start time and Icinga Recovery End time, if the recovery is more than 15 minutes, need to trigger alert.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Please share the PROBLEM and RECOVERY events. (It is rather difficult to solve your problem without being able to see what events you are dealing with!)

0 Karma

bmanikya
Loves-to-Learn Everything

Please check below snippet.

 

bmanikya_0-1721912096561.png

 

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

So, how does your rex command extract src_host_2, Service_2, and State_2 when they don't exist in the events?

0 Karma

bmanikya
Loves-to-Learn Everything
index=imdc_nagios_hadoop sourcetype=icinga host=* "Load_per_CPU_core" "PROBLEM" OR "RECOVERY"
| fields host
| search "To: <mail-addr>" 
| rex field=_raw "Host:(?<src_host_1>.*) - Service:(?<Service_1>.*) State:(?<State_1>.*)" 
| rex field=_raw "Subject: (?<Subject>.*)" 
| rex field=_raw "(?<Additional_Info>.*)\nTo:"
| eval Service= if(isnull(Service_1),Service_2,Service_1) ,src_host= if(isnull(src_host_1),src_host_2,src_host_1) ,State= if(isnull(State_1),State_2,State_1) 
| eval event_type=if(match(_raw, "Subject: PROBLEM"), "PROBLEM", "RECOVERY")
| lookup hostdata_lookup.csv host as src_host 
| table _time src_host Service State event_type cluster isvm
| search cluster=*edge* AND isvm=N
| sort src_host Service _time
| streamstats current=f window=1 last(_time) as previous_time last(event_type) as previous_event_type by src_host Service
| eval previous_time=strftime(previous_time, "%m/%d/%Y - %H:%M:%S")

 

Below is the output of above query,

 

bmanikya_0-1722237405912.png

 

If the CRITICAL alert is not RECOVERED after 15minutes, we need to alert. Any help is appreciated.

 

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or are you just interested in whether the last problem (without a recovery) was over 15 minutes ago?

Can you get multiple problems (without recovery) events for the same problem i.e. do you need to know when the latest (or any) problem started (and whether it was fixed within 15 minutes)?

0 Karma

bmanikya
Loves-to-Learn Everything

Please find my answers in BOLD

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or are you just interested in whether the last problem (without a recovery) was over 15 minutes ago? YES

Can you get multiple problems (without recovery) events for the same problem i.e. do you need to know when the latest (or any) problem started (and whether it was fixed within 15 minutes)? CORRECT

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Please clarify your requirements.

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or later?

Are you only interested in whether the last problem (without a recovery) was over 15 minutes ago?

Can you get multiple problems (without recovery) events for the same problem?

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM first occurs?

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM last occurs?

How far back are you looking for these events?

How often are you looking for these events?

0 Karma

bmanikya
Loves-to-Learn Everything

Please find my answers in bold.

 

Do you need an alert if there has been a problem which has not been recovered within 15 minutes in your data even if it was recovered after 16 minutes or later? If the PROBLEM alert is not RECOVERED after 15minutes, we need to trigger a script.

Are you only interested in whether the last problem (without a recovery) was over 15 minutes ago? YES

Can you get multiple problems (without recovery) events for the same problem? Yes, I am running this on edge nodes which are limited hosts. It could be multiple hosts as well.

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM first occurs? YES

Does the 15 minutes start when the PROBLEM event for the latest PROBLEM last occurs? NO

How far back are you looking for these events? last 30 minutes

How often are you looking for these events? Every 15 minutes

 

 

Can you check below snippet as well,

 

bmanikya_0-1722248512988.png

 

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Try something like this

...
| sort src_host Service _time
| streamstats current=f window=1 last(event_type) as previous_event_type by src_host Service
| eval problem_start=if(event_type="PROBLEM" AND (isnull(previous_event_type) OR previous_event_type != "PROBLEM"),_time,null())
| streamstats max(problem_start) as problem_start by src_host Service global=f
| eval problem_time=if(event_type="PROBLEM" OR previous_event_type="PROBLEM",_time-problem_start,null())
| where problem_time > 900
0 Karma
Get Updates on the Splunk Community!

Get Inspired! We’ve Got Validation that Your Hard Work is Paying Off

We love our Splunk Community and want you to feel inspired by all your hard work! Eric Fusilero, our VP of ...

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Hey Splunky People! We are excited to share the latest updates in Splunk Enterprise 9.4. In this release we ...

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...