Hi Splunkers,
I am working on an alert which calculates the error rate (> 30%)and send the alerts to pagerduty via API
index="test1" source="mylogs" NOT "TEST" earliest=-30m latest=now (":search-result match-indicator=\"PASS\" *******query*******
| stats count(id) AS OKStats
| appendcols [ search source=mylogs (":search-result match-indicator=\"ERROR\" *******query*******) NOT "TEST" earliest=-30m latest=now | stats count(id) AS ERRORStats]
| eval TotalTransactions = OKStats + ERRORStats
| eval ErrorRate = if(TotalTransactions > 0, round((DFAT_AP_ERR / TotalTransactions) * 100, 2), 0)
| where ErrorRate >= 30
| eval dedup_key="HighErrorRate"
| table ErrorRate, dedup_key
Now to clear the alert, I created another alert(<30%)
index="serverlogs" source="mylogs" NOT "TEST" earliest=-30m latest=now (":search-result match-indicator=\"PASS\" ************myqueryparams************
| stats count(id) AS OKStats
| appendcols [ search source=mylogs (":search-result match-indicator=\"ERROR\" ************myqueryparams************) NOT "TEST" earliest=-30m latest=now | stats count(id) AS ERRORStats]
| bin _time span=5m
| eval ErrorPercent = if((OKStats + ERRORStats) > 0, round(ERRORStats / (OKStats + ERRORStats) * 100, 2), 0)
| sort -_time
| streamstats window=2 latest(ErrorPercent) as latest_percent, latest(_time) as latest_time, earliest(ErrorPercent) as earliest_percent
| where latest_time = _time AND latest_percent < 30 AND earliest_percent >= 30
| head 1
| eval dedup_key="HighErrorRate"
| table latest_percent, earliest_percent, dedup_key
I created two conditions and sending to pagerduty(number of rows >1),running every 30min and enabled throttle. I do not see the second alert working or clearing the alert.
Any advice how to achieve the clearing of alerts which means the alert should be cleared on pagerduty. Currently creating python script is out of scope due to security reasons. Hence, was trying it via splunk query.
Regards,
Amit
In order to prevent the second search repeatedly firing when the threshold is not met, it needs some way of knowing about the previously triggered alert.
The way I would probably do this is to write to a lookup table as part of the first search which would set something that would indicate that the alert for high errors has fired, eg:
| eval isAlerting=1
| outputlookup myalert_status
Then in your second search check if the alert has previously fired, if it has then we would clear it:
index="serverlogs" source="mylogs" NOT "TEST" earliest=-30m latest=now (":search-result match-indicator=\"PASS\" ************myqueryparams************
| stats count(id) AS OKStats
| appendcols [ search source=mylogs (":search-result match-indicator=\"ERROR\" ************myqueryparams************) NOT "TEST" earliest=-30m latest=now | stats count(id) AS ERRORStats]
| bin _time span=5m
| eval ErrorPercent = if((OKStats + ERRORStats) > 0, round(ERRORStats / (OKStats + ERRORStats) * 100, 2), 0)
| sort -_time
| streamstats window=2 latest(ErrorPercent) as latest_percent, latest(_time) as latest_time, earliest(ErrorPercent) as earliest_percent
| where latest_time = _time AND latest_percent < 30 AND earliest_percent >= 30
| head 1
| eval dedup_key="HighErrorRate"
| table latest_percent, earliest_percent, dedup_key
``` your search above ```
| appendcols [|inputlookup myalert_status]
| stats first(*) as *
| where isAlerting=1
``` Will not get past here if there is no previous alert fired ```
| eval isAlerting=0
| outputlookup myalert_status
``` Cleared the alert so it will not run multiple times ```
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Thanks livehybrid.
I have applied the changes and so far not received any alert. I will wait for the threshold to be breached and see if it clears the alert based on the lookup entry.
So far i tested it manually by decreasing the threshold it created a lookup entry with alert status changing from 1 to 0.
Thanks,
Amit
Since Splunk doesn’t remember past alerts, it just fires based on current query results. If no results are returned, no alert is triggered.
Can you try with single scheduled alert, send a “trigger” (error rate ≥ 30%) and a “resolve” (error rate < 30%) to PagerDuty with the same dedup_key.
| eval status=if(ErrorRate>=30,"triggered","resolved")
| eval dedup_key="HighErrorRate"
| table status, dedup_key, ErrorRate
Regards,
Prewin
If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!
Thanks @PrewinThomas for looking and replying.
With this solution it will keep on triggering the resolution alert after every 30 min window(based on my alert condition) as it is always below threshold.
My requirement is as below
I need an alert trigger as soon as error_rate crosses 30% and if it is below 30% clear the existing alert on pagerduty and this check should run every 30 min.
I do not need any alert(pagerduty notification) if it is below 30%(which means clear alert).
Thanks,
Amit