Splunk Search

How to exclude alert to trigger during maintenance window

suhanishah
Loves-to-Learn Lots

Requirement - alert only needs to trigger outside window even if server is down in maintenance window

| tstats count where index=cts-dcpsa-app sourcetype=app:dcpsa host_ip IN (xx.xx.xxx.xxx, xx.xx.xxx.xxx) by host
| eval current_time=_time
| eval excluded_start_time=strptime("2024-04-14 21:00:00", "%Y-%m-%d %H:%M:%S")
| eval excluded_end_time=strptime("2024-04-15 04:00:00", "%Y-%m-%d %H:%M:%S")
| eval is_maintenance_window=if(current_time >= excluded_start_time AND current_time < excluded_end_time, 1, 0)
| eval is_server_down=if((host="xx.xx.xxx.xxx" AND count == 0) OR (host="xx.xx.xxx.xxx" AND count == 0) 1, 0 ) 

Trigger condition- |search is_maintenance window = 0 AND is_server_down=1

Alert is not getting triggered outside maintenance window even though one of server is down. Help me what is wrong in query or another possible solution

Labels (1)
0 Karma

richgalloway
SplunkTrust
SplunkTrust

I've had more consistent results by putting the trigger condition in the search and having the alert trigger if the number of results is not zero.

| tstats count where index=cts-dcpsa-app sourcetype=app:dcpsa host_ip IN (xx.xx.xxx.xxx, xx.xx.xxx.xxx) by host
| eval current_time=_time
| eval excluded_start_time=strptime("2024-04-14 21:00:00", "%Y-%m-%d %H:%M:%S")
| eval excluded_end_time=strptime("2024-04-15 04:00:00", "%Y-%m-%d %H:%M:%S")
| eval is_maintenance_window=if(current_time >= excluded_start_time AND current_time < excluded_end_time, 1, 0)
| eval is_server_down=if(count == 0, 1, 0) 
| where is_maintenance window = 0 AND is_server_down=1
---
If this reply helps you, Karma would be appreciated.
0 Karma

marnall
Builder

@richgalloway has a good solution. I think the "is_maintenance window" field in the condition has a typo so watch for that.

Are either of you getting _time values when using "| eval current_time = _time" after tstats? There are no _time fields specified in the first tstats command. Perhaps it would work better with "| eval current_time = now()"

0 Karma

suhanishah
Loves-to-Learn Lots

@marnall - @richgalloway  solution is creating false alerts and triggering alerts even when servers are up. Can you provide any other solution in such way that alert is not triggered in maintenance window even though servers are down but alert gets only triggered outside window with condition atleast one of the server is down

0 Karma

sjringo
Contributor

Try this:

| tstats count where index=cts-dcpsa-app sourcetype=app:dcpsa host_ip IN (xx.xx.xxx.xxx, xx.xx.xxx.xxx) by host
| eval current_time=strftime(now(), "%H%M")
| eval is_maintenance_window=if(current_time >= 2100 AND current_time < 0400, 1, 0)
| eval is_server_down=if(count == 0, 1, 0)
| where is_maintenance window = 0 AND is_server_down=1
0 Karma

suhanishah
Loves-to-Learn Lots

@sjringo - what should be my trigger condition ? 
Also how your query will identify which date as I don't want alert not to be triggered everyday from 21:00 to 4:00 am. I want just specific date which is going to be 23rd april from 9 pm to 24th april 4 am 

 

0 Karma

sjringo
Contributor

Your trigger condition is the same it was before?

| where is_maintenance window = 0 AND is_server_down=1

Im assuming your maintenance window is on a specific day of the week ?

April 23rd is a Tuesday, is your maintenance window is every Tuesday night/Wed morning ?
Introduce a new attribute for day of the week:

| tstats count where index=cts-dcpsa-app sourcetype=app:dcpsa host_ip IN (xx.xx.xxx.xxx, xx.xx.xxx.xxx) by host
| eval current_time=strftime(now(), "%H%M")
| eval aDayNumber = strftime(now(), "%w")
| eval is_maintenance_window=if((aDayNumber = 2 AND current_time >= 2100) OR (aDayNumber = 3 AND current_time < 0400), 1, 0)
| eval is_server_down=if(count == 0, 1, 0)
| where is_maintenance window = 0 AND is_server_down=1

 

0 Karma

suhanishah
Loves-to-Learn Lots

@sjringo  - We don't have specific date as it keeps changing so I created two variables where I specify date and time . 
 

| eval excluded_start_time=strptime("2024-04-14 21:00:00", "%Y-%m-%d %H:%M:%S")
| eval excluded_end_time=strptime("2024-04-15 04:00:00", "%Y-%m-%d %H:%M:%S

 

0 Karma

sjringo
Contributor

All I can say is use now() instead of _time to use in the evaluation on whether to trigger or not on the solution provided earlier ?

Do you have any test data to show your attribute values to help figure out why its false triggering ?

| eval current_time=now()
| eval excluded_start_time=strptime("2024-04-14 21:00:00", "%Y-%m-%d %H:%M:%S")
| eval excluded_end_time=strptime("2024-04-15 04:00:00", "%Y-%m-%d %H:%M:%S")
| eval is_maintenance_window=if(current_time >= excluded_start_time AND current_time < excluded_end_time, 1, 0)
0 Karma

suhanishah
Loves-to-Learn Lots

@sjringo  - This is the result when servers are taking traffic . I am going to test it tonight when servers goes down if alert is getting triggered outside window as well as alert not triggered during window . In both cases atleast one server is down. Capture.PNG

0 Karma

sjringo
Contributor

When your testing just keep in mind that this is the time from the log event.

| eval current_time=_time

While this is the current time now, when the alert is running.

So, depending upon your lookback period (earliest= latest=) you might be picking up log events outside (prior or after) your outage window start time/end time. 

| eval current_time=now()

But, if you dont want any alerts during the outage window now() should be the correct time to be using for your triggering conditions

0 Karma

suhanishah
Loves-to-Learn Lots

I have created two queries : The below is for the correct outage window 

Capture1.PNG

And the second one with any random date to see if alert is triggered when one of server goes down 

Capture2.PNG

Both has same trigger condition set : | where is_maintenance_window=0 AND is_server_down=1

0 Karma

sjringo
Contributor

A Couple of changes from your last image.  Notice the change in evaluating time variables.  now(), strptime instead of strftime:
You could also remove the eval = aHostMatch... code if you are filtering the hosts in the initial TSTATS.

| tstats count where index=cts-dcpsa-app sourcetype=app:dcpsa host_ip IN (xx.xx.xxx.xxx, xx.xx.xxx.xxx) by host
| eval currTime = now() ```<- I was not getting a value when using _time with TSTATS ? ```
| eval excluded_start_time=strptime("2024-03-16 18:25:00", "%Y-%m-%d %H:%M")
| eval excluded_stop_time=strptime("2024-03-16 18:30:00", "%Y-%m-%d %H:%M")
| eval is_maintenance_window=if(currTime >= excluded_start_time AND currTime <= excluded_stop_time,1,0)
| eval aHostMatch = case(
match(host,"HOSTNAME1"),1, ```<- Case Sensitive```
match(host,"HOSTNAME2"),1, ```<- Case Sensitive```
true(),0)
```| where count == 0 AND is_maintenance_window == 1 AND aHostMatch ==1```
| table host count excluded_start_time, currTime, excluded_stop_time, is_maintenance_window, aHostMatch

Also, if a host is not reporting data (down) you will not have a row returned from your initial query and no row for that host for when you check ( where a count == 0 )

TSTATS does not support multiple timeframes...

Another approach is to not use tstats and use a stats count
First query: (earliest=-30m@m latest=-15m@m) to count historical entries, then a second query to get current entries (earliest=-14m@m latest=-1m@m), then compare historical counts and current counts by host

index=cts-dcpsa-app host=HOSTNAME1 OR host=HOSTNAME2 earliest=-30m@m latest=-15m@m
| stats count AS aHistCount by host
| appendcols
[ search index = cts-dcpsa-app host=HOSTNAME1 OR host=HOSTNAME2 earliest=-14m@m latest=-1m@m
| stats count AS aCurrCount by host
| table host, aCurrCount
]
| table host, aHistCount, aCurrCount

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to July and August Tech Talks, Office Hours, and Webinars!

Dive into our sizzling summer lineup for July and August Community Office Hours and Tech Talks. Scroll down to ...

Edge Processor Scaling, Energy & Manufacturing Use Cases, and More New Articles on ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Get More Out of Your Security Practice With a SIEM

Get More Out of Your Security Practice With a SIEMWednesday, July 31, 2024  |  11AM PT / 2PM ETREGISTER ...