Alerting

How to search a trending error count to alert when an application pool is more than 2 standard deviations from the normal?

daniel333
Builder

All,

I want to have an alert fire any time an application pool is more than say 2 standard deviations from the normal. We have about 100 application pools.

I am guessing the logic would look something like this?

 tag=java tag=problem | 
stats count by app_pool |
where count > [somelogic 2std * somesplunkcommand I dont know]
0 Karma

woodcock
Esteemed Legend

Try this:

tag=java tag=problem | bucket _time span=1h | stats count BY _time app_pool | eventstats stdev(count) AS stdev BY app_pool | where count > (2 * stdev)

tkwaller
Builder

YES. Why I didnt get that I'll never know. I tried bucketing but it seems not the eventstats.
Thanks for the help, as always.
Todd

0 Karma

woodcock
Esteemed Legend

The problem was chart vs. stats and creating columns instead of rows. Don't forget to click Accept.

0 Karma

woodcock
Esteemed Legend

Try this:

tag=java tag=problem | stats count by app_pool | eventstats stdev(count) AS stdev | where count > (2 * stdev)

tkwaller
Builder

any updates?

0 Karma

tkwaller
Builder

Well, this KINDA works. What happens when this is run is that it gives 1 stdev for ALL app_pools but what we need is the stdev for EACH app_pool.

For example this is the output using this search:
app_pool count stdev
1 aaa 14576 10478.310567
2 abb 342 10478.310567
3 acc 45 10478.310567
4 add 1824 10478.310567

What we are trying to achieve is something like this:
app_pool count stdev
1 aaa 14576 its stdev
2 abb 342 its stdev
3 acc 45 its stdev
4 add 1824 its stdev

then we can use:
where count > (2 * stdev)
to alert on.

I tried something like:
| eventstats stdev(count) AS stdev by app_pool

but that returns a stdev of 0 for all app_pools

0 Karma

woodcock
Esteemed Legend

Back it up. To do a stdev, you need series of numbers so you have to have a count of something. Unless your raw data has counts (which clearly it does not, since you are using count instead of sum), then we must do a count first, that is why I wrote it the way that I did. We could use timechart to generate an series of counts per app_pool, say hourly, from which we could then to a stdev per app_pool but we MUST have a series of numbers FIRST and only you can specify the necessary parameters. As an example, here is a solution for hourly timecharting:

 tag=java tag=problem | timechart span=1h count BY app_pool | eventstats stdev(count) AS stdev BY app_pool | where count > (2 * stdev)
0 Karma

tkwaller
Builder

I understand.
I DID try this before posting it here using the timechart command BUT I couldn't get it to work. The one above does not work either, it returns 0 results. I'm guessing that its not returning a stdev as I removed the "| where count > (2 * stdev)" portion and it seems its returning a count but not a stdev:

_time aaa abb acc add
2016-04-28 09:00 5377 728 174 28790
2016-04-28 10:00 4303 584 29 18686

I confirmed this by only running:
tag=java tag=problem | timechart span=1h count by app_pool
and it returns the same results.

So it seems that its counting properly but not calculating the stdev after counting.
This search:
tag=java tag=problem | timechart span=1h count by app_pool| eventstats stdev(count) AS stdev by app_pool

is the same results as this search:
tag=java tag=problem | timechart span=1h count by app_pool

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...