Hello,
I have a query in which I display some value over time in a chart and I want to create an alert that will be triggered when this value is over some threshold for more then 10 minutes straight.
How would I perform this alert?
| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1m where pod=sf-mcdata--hydration-worker* AND stack=* by stack
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| stats first(p95_memory_percent) as first_p95_memory_percent by stack,_time
| eval threshold = 85
| eval aboveThreshold = if (first_p95_memory_percent > 6,1,0)
| stats sum(aboveThreshold) as amountAboveThreshold by stack
| where amountAboveThreshold = 10
then alert when number of results greater than 0
Create a report that looks at the previous 10 minutes and checks the value for each of those 10 minutes, then count the number of minutes that the value exceeded the threshold. Based on this count, trigger your alert.
Schedule the report to run every minute.
Simples 😁
that sounds like a good idea, can I send an alert via email if the count=10?
or should I create a new search query that will have a count column and create an alert that will be triggered when this value equals 10?
I would probably trigger when the count is 10 rather than writing another report.
If I have the following modified query:
| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1m where pod=sf-mcdata--hydration-worker* AND stack=* by stack
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| stats first(p95_memory_percent) as first_p95_memory_percent by stack,_time
| eval threshold = 85
| eval aboveThreshold = if (first_p95_memory_percent > 6,1,0)
| stats sum(aboveThreshold) as amountAboveThreshold by stack
I would want to create an alert with the following trigger:
search amountAboveThreshold = 10
and this alert will run every minute over the last 10 minutes.
did I get it right?
This will only work for the first row i.e. the first stack. Is that what you intended?
No I want to alert if this condition is met in any of the stacks.
What do I need to modify for it to work?
| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1m where pod=sf-mcdata--hydration-worker* AND stack=* by stack
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| stats first(p95_memory_percent) as first_p95_memory_percent by stack,_time
| eval threshold = 85
| eval aboveThreshold = if (first_p95_memory_percent > 6,1,0)
| stats sum(aboveThreshold) as amountAboveThreshold by stack
| where amountAboveThreshold = 10
then alert when number of results greater than 0
Ok that looks good, one last question:
Would this alert for each stack? I want to include in the alert message the stack it happened on.
so if it happened in two or more stacks it will only alert on one of them if I use this method right?
If you trigger for each result, you can use the field from the result i.e. you should be able to get an email (or whatever your trigger action is) for each stack with the problem.
Awesome so that will work for me.
now I have another query that looks like this:
| mstats rate_sum(mc_hydration_worker_total_message_duration_ms.sum) as metric_sum span=1h where index=e360_analytics_hydration_metrics host=sf-mcdata--hydration-worker* stack=* by stack
| appendcols [
| mstats rate_sum(mc_hydration_worker_total_message_duration_ms.count) as metric_count span=1h where index=e360_analytics_hydration_metrics host=sf-mcdata--hydration-worker* stack=* by stack
]
| eval metric_rate=metric_sum/metric_count
| stats p95(metric_rate) as Hydration_Duration by stack
| where Hydration_Duration > 2500
ow I also want to alert if Hydration_Duration is greater then 2500 for any of the stacks and alert about all of them.
So If I use this query and trigger for each result if the results count is greater then zero.
this should work right?
It is often said that appendcols is never the answer, there are exceptions, but this (probably) isn't one of them.
The way appendcols works is that there is no intrinsic guarantee that the order of the rows returned by the subsearch will be the same as the main search, so the values in the rows could misalign.
The reason I said probably is that you are using the same data and the same dimension in the by clause so they probably will align.
Secondly, subsearches are limited to 50,000 events, which is probably not an issue in your case, but something to always bear in mind.
Rather than taking the risk, you could try it this way
| mstats rate_sum(mc_hydration_worker_total_message_duration_ms.sum) as metric_sum rate_sum(mc_hydration_worker_total_message_duration_ms.count) as metric_count span=1h where index=e360_analytics_hydration_metrics host=sf-mcdata--hydration-worker* stack=* by stack
| eval metric_rate=metric_sum/metric_count
| stats p95(metric_rate) as Hydration_Duration by stack
| where Hydration_Duration > 2500
Hi @auzelevski,
if you could share your search it could be easier to help you.
Anyway, the general rule is:
<your_search>
| stats count BY key
| where count>threeshold
then you can configure your alert for results greater than zero.
Ciao.
Giuseppe
Hi, @gcusello this is my query:
| mstats p95(prometheus.container_memory_working_set_bytes) as p95_memory_bytes span=1h where pod=sf-mcdata--hydration-worker* AND stack=* by stack,sp
| eval p95_memory_percent=100*p95_memory_bytes/(8*1024*1024*1024)
| chart first(p95_memory_percent) as test over _time by stack
| eval threshold=85