I had the situation that I wanted to know why an alert wasn't fired for a resource. Therefore I was looking which field values (don't know how to describe it better) are currently stored in Splunk for suppressing there alert action to be executed.
To make it better understandable what I mean, here a short fictive example:
Use Case: Monitoring of CPU usage of hosts. When the CPU usage hits the 80% threshold fire an alert and throttle alert for 1 hour, based on host field.
Question: How can I determine which for which hosts the alert is throttled.
Note: I'm interested in the throttling list the alert uses. Not in approaches that evaluate the CPU usage events.
Thank you in advance.
You have a scheduled search that runs at specific times
index=_internal source=*scheduler.log host=shc_hosts here or standalone_sh_name_here savedsearch_name="*name of search here*" for the time range in question.
Fields to look at
result_count Where results found?
status did the search run or get skipped
alert_actions if yes then it most likely worked unless you had a large result set perhaps some did and some where throttled
suppression Was one or more things throttled from your result set
This will tell you if the search found results / if it executed an alert action / if something in the results was suppressed (aka throttled)
There are csv suppression files in /opt/splunk/var/run/splunk/scheduler/suppression/
These are the files that are checked to determine if something is throttled the issue here is that most results in this data have a hash of the fields your are suppressing on, so you can decode them. But you can see other information about the throttles
"admin;SA-Utils;Audit - Script Errors;1c40161ea84755387fddbdfd9babb74e;",1594015219,ADD,0E4E58D12C9EC1325A555B91A142A336
Maybe this can help you out some?
The real question
Thanks for your answer.
This will help, when we want to check the current status of throttled values. But what I forgot to mention is that I also want to see a historic status of no longer throttled values.
My real use case was to understand why an alert fired for some, but not all expected results. And at a time where the throttling was already outdated.
Your approach is very nice, but cumbersome to use in daily business, as there is no direct way to get information about the status of throttling. (Need of hashing values to look up in throttling CSV files)
Nevertheless I learned some new stuff. Thank you for that. 🙂
As a possible workaround, for this information not to be directly accessible in Splunk, I got a hint to use the alert action Log Event to write the desired information into an index.