Alerting

standard deviation to alert us when we see source types and/or indexes grow more then expected

sbattista09
Contributor

is there a easy way to create a alert that uses standard deviation to alert us when we see sourcetypes and/or indexes grow more than a certain percentage within the license data in the _internal index.

i am thinking about the following search:

index=_internal source=*license_usage.log type="Usage" 
| stats sum(b) as b by _time, pool, st 
| eval "b"=round (b/1024/1024/1024, 2)
| timechart span=7d sum(b) by st useother=f   

then add the last weeks data and a " | where % >20"

Goals for this search:
1. Quickly identify and alert when a data source is blowing up our licenses.
2. Quickly identify and alert when a data source is experiencing logging issues such as a whole environment/sourcetype/index is longer sending logs by reversing the logic.

_gkollias
Builder

My top two go-to's for this would be:

1) Use the Machine Learning Toolkit and let it do the work for you!
2) Try using the stats command and the perc function to compare current vs historical values. This is a timeless blog post that always comes in handy: https://www.splunk.com/blog/2016/01/29/writing-actionable-alerts.html

Here is an example:

index=_internal sourcetype=splunkd component=LicenseUsage type=Usage 
| bin span=5m _time
| stats sum(b) as bytes by _time, idx
| eval indexed_mb = ceiling(bytes/1048576) 
| stats sum(indexed_mb) as indexed_mb by _time, idx
| stats perc95(indexed_mb) as perc95_indexed_mb, latest(indexed_mb) as current_indexed_mb by idx
| where current_indexed_mb > perc95_indexed_mb

This way you aren't hardcoding values, and the search adjusts to environment changes on-the-fly. You can adjust the percentiles as you see fit, and even go more granular by breaking things out by index, sourcetype (whatever makes sense for your environment). Play around with this and let me know how it works for you.

sloshburch
Splunk Employee
Splunk Employee

Thanks for plugging the blog!

0 Karma

sbattista09
Contributor

good idea! If i wanted to get a time chart to see if a source type is lets say 20% more than it was last week would MLTK be able to do that as well?

0 Karma

jacobpevans
Motivator

This query detects any index/sourcetype combinations that reported in the previous 24 hours but not in the past 24 hours.

| tstats count where index=* earliest=-48h latest=-24h by host sourcetype | rename count as oldcount
| join type=left host sourcetype
    [ | tstats count where index=* earliest=-24h latest=now by host sourcetype | rename count as currentcount]
| where isnull(currentcount)
Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

sbattista09
Contributor

thanks for your input! it was helpful to get started.
I am thinking i am getting close if i can just take the search i made a few years ago and get it to work with the sourcetypes and alert off the -volume_p1 when "Where pct_diff > 20.0" we will be in working shape! wish me luck!

index=_internal source=*license_usage.log type="Usage" 
| eval h=if(len(h)=0 OR isnull(h),"(SQUASHED)",h) 
| eval s=if(len(s)=0 OR isnull(s),"(SQUASHED)",s) 
| eval idx=if(len(idx)=0 OR isnull(idx),"(UNKNOWN)",idx) 
| stats sum(b) as b by _time, pool, s, h, idx 
| search pool="Splunk Production" 
| timechart span=1h sum(b) AS volume 
| eval "volume"=round (volume/1024/1024/1024, 2) 
| reverse 
| autoregress volume 
| eval pct_diff=1.00*(volume-volume_p1) 
| Where pct_diff > 20.0

jacobpevans
Motivator

We call this one "hosts with abnormal metadata". It includes any host that reported in the previous 24 hours but not in the past 24 hours. It also detects hosts that just started reporting in the past 24 hours. I realize this does not fully answer your question which is why this is a comment instead of an answer.

| metadata type=hosts
| eval daysSinceFirstEvent = round((now() - firstTime)/86400, 2)
| eval  daysSinceLastEvent = round((now() - lastTime )/86400, 2)
| eval hoursSinceLastEvent = round((now() - lastTime )/3600 , 2)
| sort firstTime
| convert ctime(firstTime) as firstTime
| convert ctime(lastTime)  as lastTime
| search daysSinceFirstEvent < 1 OR (hoursSinceLastEvent>24 AND hoursSinceLastEvent<48)
| table host firstTime daysSinceFirstEvent lastTime daysSinceLastEvent hoursSinceLastEvent
Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.
Get Updates on the Splunk Community!

3 Ways to Make OpenTelemetry Even Better

My role as an Observability Specialist at Splunk provides me with the opportunity to work with customers of ...

What's New in Splunk Cloud Platform 9.2.2406?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2406 with many ...

Enterprise Security Content Update (ESCU) | New Releases

In August, the Splunk Threat Research Team had 3 releases of new security content via the Enterprise Security ...