Alerting

Need help creating an complex (at least to me) alert.

lyndac
Contributor

I have about 6 month's worth of data in a summary index that is hourly filecounts and volume for ftp servers. I am trying to generate an alert when any server exceeds its average (for the day of the week) filecount or average volume by +/- 25%.

So far I have the search below and then make a custom condition to check if the current count(or volume) is > the high or < the low. It works for a single server (there could be 100's) and always compares today's data against the Monday's in the past 30 days.

index=si-br server=foo earliest=-30d@d latest=-0d@d 
    | stats count as count sum(filesize) as volume by  server,location,_time 
    | bin _time span=1d 
    | stats sum(volume) as volume sum(count) as count by _time
    | where strftime(_time, "%w") == "1"
    | stats avg(count) as avgcount avg(volume) as avgvolume
    | appendcols [
        search index=br earliest=-0d@d latest=now server=foo  
          | stats count as count sum(filesize) as volume 
          | eval highcount=avgcount*1.25 
          | eval lowcount=avgcount*.75 
          | eval highvol=avgvolume*1.25 
          | eval lowvol=avgvolume*.75
      ]

My questions to you:

  1. How do I make the day of the week dynamic? (if today is Tueday, I want to compare strftime(_time,"%w") == "2", but if it is Friday, I want strftime(_time,"%w") == "5". I suppose I could make 7 clones of the alert, change the day of week and schedule to run once a week on the correct day. Is there another way?
  2. How can I exclude any "abnormal" counts/volumes from the calculation of the average. Say the 2nd Monday of the month had a 30% spike in volume that I do not what considered part of the average.
  3. How do I scale this out for potentially 100's of servers? Do I have to create a different alert for each server?
  4. Is it possible to use a lookup table to use a different threshold for each server ( 25% for some, 50% for others...)
0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

How do I make the day of the week dynamic? Is there another way?

Yes, you can use eventstats instead of stats and use a by clause of a weekday field that you extract using strftime, similar to your where clause. Also, I don't think you need the first stats before the bin.

index=si-br server=foo earliest=-30d@d latest=-0d@d 
| bin _time span=1d 
| eval weekday = strftime(_time, "%A") 
| stats sum(filesize) as volume, count as count by _time, weekday 
| eventstats avg(count) as avg_count avg(volume) as avg_volume by weekday 
| eval highcount = avg_count * 1.25 
| eval lowcount = avg_count * .75 
| eval highvol = avg_volume * 1.25 
| eval lowvol = avg_volume * .75 
| appendcols
    [ search index=br earliest=-0d@d latest=now server=foo
    | eval sub_weekday = strftime(_time, "%A") 
    | stats count as count sum(filesize) as volume ]
| where sub_weekday == weekday AND (volume > highvol) OR (count > high count)

How can I exclude any "abnormal" counts/volumes from the calculation of the average?

There are lots of ways of doing this. The first and easiest way I can think of is using the outliers command or the anomalydection command (6.3+).

| stats sum(filesize) as volume, count as count by _time, weekday
| anomalydetection volume count method=iqr action=transform

method=iqr is really a wrapper around the old outlier command, which would be used like this:

| stats sum(filesize) as volume, count as count by _time, weekday
| outlier volume count action=transform

The filtering is based on the inter-quartile range (IQR), which is computed from the difference between the 25th percentile and 75th percentile values of the numeric fields. If the value of a field in an event is less than (25th percentile) - param*IQR or greater than (75th percentile) + param*IQR , that field is transformed or that event is removed based on the action parameter.

If you wanted to completely remove those, rather than pulling them down to the 75th / 25th percentile, you can use action=remove for either command, anomalydetection or outlier.

How do I scale this out for potentially 100's of servers? Do I have to create a different alert for each server?

The by clause is your best friend. Lets imagine you have servers foo1 through foo300.

index=si-b earliest=-30d@d latest=-0d@d 
| bin _time span=1d 
| eval weekday = strftime(_time, "%A") 
| stats sum(filesize) as volume, count as count by _time, weekday, server
| eventstats avg(count) as avg_count avg(volume) as avg_volume by weekday, server
| eval highcount = avg_count * 1.25 
| eval lowcount = avg_count * .75 
| eval highvol = avg_volume * 1.25 
| eval lowvol = avg_volume * .75 
| appendcols
    [ search index=br earliest=-0d@d latest=now
    | eval sub_weekday = strftime(_time, "%A") 
    | stats count as count sum(filesize) as volume by server ] 
| where sub_weekday == weekday AND (volume > highvol) OR (count > highcount)

Is it possible to use a lookup table to use a different threshold for each server ( 25% for some, 50% for others...)

Yes you definitely could, e.g.

| lookup thresholds.csv host as host OUTPUT type
| eval highcount = case(
type == “sensitive”, avg_count * 1.05,
type == “not_sensistive”, avg_count * 2)

There are other thresholds that are more dynamic than a lookup table or what you have currently chosen. There is a really great showcase / dashboard in the Machine Learning Toolkit and Showcase app called Detect Numeric Outliers - which if you take a look at, can show you how to use a few methods for outlier detection:

  • median absolute deviation
  • interquartile range
  • standard deviations

I’m guessing that one of these might be more appropriate.

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

How do I make the day of the week dynamic? Is there another way?

Yes, you can use eventstats instead of stats and use a by clause of a weekday field that you extract using strftime, similar to your where clause. Also, I don't think you need the first stats before the bin.

index=si-br server=foo earliest=-30d@d latest=-0d@d 
| bin _time span=1d 
| eval weekday = strftime(_time, "%A") 
| stats sum(filesize) as volume, count as count by _time, weekday 
| eventstats avg(count) as avg_count avg(volume) as avg_volume by weekday 
| eval highcount = avg_count * 1.25 
| eval lowcount = avg_count * .75 
| eval highvol = avg_volume * 1.25 
| eval lowvol = avg_volume * .75 
| appendcols
    [ search index=br earliest=-0d@d latest=now server=foo
    | eval sub_weekday = strftime(_time, "%A") 
    | stats count as count sum(filesize) as volume ]
| where sub_weekday == weekday AND (volume > highvol) OR (count > high count)

How can I exclude any "abnormal" counts/volumes from the calculation of the average?

There are lots of ways of doing this. The first and easiest way I can think of is using the outliers command or the anomalydection command (6.3+).

| stats sum(filesize) as volume, count as count by _time, weekday
| anomalydetection volume count method=iqr action=transform

method=iqr is really a wrapper around the old outlier command, which would be used like this:

| stats sum(filesize) as volume, count as count by _time, weekday
| outlier volume count action=transform

The filtering is based on the inter-quartile range (IQR), which is computed from the difference between the 25th percentile and 75th percentile values of the numeric fields. If the value of a field in an event is less than (25th percentile) - param*IQR or greater than (75th percentile) + param*IQR , that field is transformed or that event is removed based on the action parameter.

If you wanted to completely remove those, rather than pulling them down to the 75th / 25th percentile, you can use action=remove for either command, anomalydetection or outlier.

How do I scale this out for potentially 100's of servers? Do I have to create a different alert for each server?

The by clause is your best friend. Lets imagine you have servers foo1 through foo300.

index=si-b earliest=-30d@d latest=-0d@d 
| bin _time span=1d 
| eval weekday = strftime(_time, "%A") 
| stats sum(filesize) as volume, count as count by _time, weekday, server
| eventstats avg(count) as avg_count avg(volume) as avg_volume by weekday, server
| eval highcount = avg_count * 1.25 
| eval lowcount = avg_count * .75 
| eval highvol = avg_volume * 1.25 
| eval lowvol = avg_volume * .75 
| appendcols
    [ search index=br earliest=-0d@d latest=now
    | eval sub_weekday = strftime(_time, "%A") 
    | stats count as count sum(filesize) as volume by server ] 
| where sub_weekday == weekday AND (volume > highvol) OR (count > highcount)

Is it possible to use a lookup table to use a different threshold for each server ( 25% for some, 50% for others...)

Yes you definitely could, e.g.

| lookup thresholds.csv host as host OUTPUT type
| eval highcount = case(
type == “sensitive”, avg_count * 1.05,
type == “not_sensistive”, avg_count * 2)

There are other thresholds that are more dynamic than a lookup table or what you have currently chosen. There is a really great showcase / dashboard in the Machine Learning Toolkit and Showcase app called Detect Numeric Outliers - which if you take a look at, can show you how to use a few methods for outlier detection:

  • median absolute deviation
  • interquartile range
  • standard deviations

I’m guessing that one of these might be more appropriate.

sundareshr
Legend

How do I make the day of the week dynamic? (if today is Tueday, I want to compare strftime(_time,"%w") == "2", but if it is Friday, I want strftime(_time,"%w") == "5". I suppose I could make 7 clones of the alert, change the day of week and schedule to run once a week on the correct day. Is there another way?
- You could try ` strftime(_time, "%w") == strftime(now(), "%w")

How can I exclude any "abnormal" counts/volumes from the calculation of the average. Say the 2nd Monday of the month had a 30% spike in volume that I do not what considered part of the average.
- This may be your answer http://docs.splunk.com/Documentation/Splunk/6.2.4/Search/Findingandremovingoutliers

How do I scale this out for potentially 100's of servers? Do I have to create a different alert for each server?
- Have you tried the "For Each" option under Alert Actions. When triggered, executes actions for each result. So if you result has mutliple servers, it will email for each server

Is it possible to use a lookup table to use a different threshold for each server ( 25% for some, 50% for others...)
Can you can filter your data using bechmarks set from lookups?

0 Karma
Get Updates on the Splunk Community!

New Case Study Shows the Value of Partnering with Splunk Academic Alliance

The University of Nevada, Las Vegas (UNLV) is another premier research institution helping to shape the next ...

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...