Alerting

Trigger alert for value above threshold for some time?

auzelevski
Explorer

Hello,
I have a query in which I display some value over time in a chart and I want to create an alert that will be triggered when this value is over some threshold for more then 10 minutes straight.
How would I perform this alert?

Labels (1)
0 Karma
1 Solution

ITWhisperer
SplunkTrust
SplunkTrust
| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1m where pod=sf-mcdata--hydration-worker* AND stack=* by stack
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| stats first(p95_memory_percent) as first_p95_memory_percent by stack,_time
| eval threshold = 85
| eval aboveThreshold = if (first_p95_memory_percent > 6,1,0)
| stats sum(aboveThreshold) as amountAboveThreshold by stack
| where amountAboveThreshold = 10

then alert when number of results greater than 0

View solution in original post

ITWhisperer
SplunkTrust
SplunkTrust

Create a report that looks at the previous 10 minutes and checks the value for each of those 10 minutes, then count the number of minutes that the value exceeded the threshold. Based on this count, trigger your alert.

Schedule the report to run every minute.

Simples 😁

0 Karma

auzelevski
Explorer

that sounds like a good idea, can I send an alert via email if the count=10?
or should I create a new search query that will have a count column and create an alert that will be triggered when this value equals 10? 

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

I would probably trigger when the count is 10 rather than writing another report.

0 Karma

auzelevski
Explorer

If I have the following modified query:

| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1m where pod=sf-mcdata--hydration-worker* AND stack=* by stack
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| stats first(p95_memory_percent) as first_p95_memory_percent by stack,_time
| eval threshold = 85
| eval aboveThreshold = if (first_p95_memory_percent > 6,1,0)
| stats sum(aboveThreshold) as amountAboveThreshold by stack

I would want to create an alert with the following trigger:

search amountAboveThreshold = 10

and this alert will run every minute over the last 10 minutes.
did I get it right?

Tags (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

This will only work for the first row i.e. the first stack. Is that what you intended?

0 Karma

auzelevski
Explorer

No I want to alert if this condition is met in any of the stacks.
What do I need to modify for it to work?

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust
| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1m where pod=sf-mcdata--hydration-worker* AND stack=* by stack
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| stats first(p95_memory_percent) as first_p95_memory_percent by stack,_time
| eval threshold = 85
| eval aboveThreshold = if (first_p95_memory_percent > 6,1,0)
| stats sum(aboveThreshold) as amountAboveThreshold by stack
| where amountAboveThreshold = 10

then alert when number of results greater than 0

auzelevski
Explorer

Ok that looks good, one last question:

Would this alert for each stack? I want to include in the alert message the stack it happened on.
so if it happened in two or more stacks it will only alert on one of them if I use this method right?

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

If you trigger for each result, you can use the field from the result i.e. you should be able to get an email (or whatever your trigger action is) for each stack with the problem.

auzelevski
Explorer

Awesome so that will work for me.

now I have another query that looks like this:

| mstats rate_sum(mc_hydration_worker_total_message_duration_ms.sum) as metric_sum span=1h where index=e360_analytics_hydration_metrics host=sf-mcdata--hydration-worker* stack=* by stack
| appendcols [
  | mstats rate_sum(mc_hydration_worker_total_message_duration_ms.count) as metric_count span=1h where index=e360_analytics_hydration_metrics host=sf-mcdata--hydration-worker* stack=* by stack
]
| eval metric_rate=metric_sum/metric_count
| stats p95(metric_rate) as Hydration_Duration by stack
| where Hydration_Duration > 2500

 

ow I also want to alert if Hydration_Duration is greater then 2500 for any of the stacks and alert about all of them.
So If I use this query and trigger for each result if the results count is greater then zero.

this should work right?

 

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

It is often said that appendcols is never the answer, there are exceptions, but this (probably) isn't one of them.

The way appendcols works is that there is no intrinsic guarantee that the order of the rows returned by the subsearch will be the same as the main search, so the values in the rows could misalign.

The reason I said probably is that you are using the same data and the same dimension in the by clause so they probably will align.

Secondly, subsearches are limited to 50,000 events, which is probably not an issue in your case, but something to always bear in mind.

Rather than taking the risk, you could try it this way

| mstats rate_sum(mc_hydration_worker_total_message_duration_ms.sum) as metric_sum rate_sum(mc_hydration_worker_total_message_duration_ms.count) as metric_count span=1h where index=e360_analytics_hydration_metrics host=sf-mcdata--hydration-worker* stack=* by stack
| eval metric_rate=metric_sum/metric_count
| stats p95(metric_rate) as Hydration_Duration by stack
| where Hydration_Duration > 2500

 

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @auzelevski,

if you could share your search it could be easier to help you.

Anyway, the general rule is:

<your_search>
| stats count BY key
| where count>threeshold

then you can configure your alert for results greater than zero.

Ciao.

Giuseppe

0 Karma

auzelevski
Explorer

Hi, @gcusello this is my query:

| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1h where pod=sf-mcdata--hydration-worker* AND stack=* by stack,sp
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| chart first(p95_memory_percent) as test over _time by stack
| eval threshold=85

 

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...