is there a easy way to create a alert that uses standard deviation to alert us when we see sourcetypes and/or indexes grow more than a certain percentage within the license data in the _internal index.
i am thinking about the following search:
index=_internal source=*license_usage.log type="Usage" | stats sum(b) as b by _time, pool, st | eval "b"=round (b/1024/1024/1024, 2) | timechart span=7d sum(b) by st useother=f
then add the last weeks data and a " | where % >20"
Goals for this search:
1. Quickly identify and alert when a data source is blowing up our licenses.
2. Quickly identify and alert when a data source is experiencing logging issues such as a whole environment/sourcetype/index is longer sending logs by reversing the logic.
My top two go-to's for this would be:
1) Use the Machine Learning Toolkit and let it do the work for you!
2) Try using the
stats command and the
perc function to compare current vs historical values. This is a timeless blog post that always comes in handy: https://www.splunk.com/blog/2016/01/29/writing-actionable-alerts.html
Here is an example:
index=_internal sourcetype=splunkd component=LicenseUsage type=Usage | bin span=5m _time | stats sum(b) as bytes by _time, idx | eval indexed_mb = ceiling(bytes/1048576) | stats sum(indexed_mb) as indexed_mb by _time, idx | stats perc95(indexed_mb) as perc95_indexed_mb, latest(indexed_mb) as current_indexed_mb by idx | where current_indexed_mb > perc95_indexed_mb
This way you aren't hardcoding values, and the search adjusts to environment changes on-the-fly. You can adjust the percentiles as you see fit, and even go more granular by breaking things out by
index, sourcetype (whatever makes sense for your environment). Play around with this and let me know how it works for you.
This query detects any index/sourcetype combinations that reported in the previous 24 hours but not in the past 24 hours.
| tstats count where index=* earliest=-48h latest=-24h by host sourcetype | rename count as oldcount | join type=left host sourcetype [ | tstats count where index=* earliest=-24h latest=now by host sourcetype | rename count as currentcount] | where isnull(currentcount)
thanks for your input! it was helpful to get started.
I am thinking i am getting close if i can just take the search i made a few years ago and get it to work with the sourcetypes and alert off the -volume_p1 when "Where pct_diff > 20.0" we will be in working shape! wish me luck!
index=_internal source=*license_usage.log type="Usage" | eval h=if(len(h)=0 OR isnull(h),"(SQUASHED)",h) | eval s=if(len(s)=0 OR isnull(s),"(SQUASHED)",s) | eval idx=if(len(idx)=0 OR isnull(idx),"(UNKNOWN)",idx) | stats sum(b) as b by _time, pool, s, h, idx | search pool="Splunk Production" | timechart span=1h sum(b) AS volume | eval "volume"=round (volume/1024/1024/1024, 2) | reverse | autoregress volume | eval pct_diff=1.00*(volume-volume_p1) | Where pct_diff > 20.0
We call this one "hosts with abnormal metadata". It includes any host that reported in the previous 24 hours but not in the past 24 hours. It also detects hosts that just started reporting in the past 24 hours. I realize this does not fully answer your question which is why this is a comment instead of an answer.
| metadata type=hosts | eval daysSinceFirstEvent = round((now() - firstTime)/86400, 2) | eval daysSinceLastEvent = round((now() - lastTime )/86400, 2) | eval hoursSinceLastEvent = round((now() - lastTime )/3600 , 2) | sort firstTime | convert ctime(firstTime) as firstTime | convert ctime(lastTime) as lastTime | search daysSinceFirstEvent < 1 OR (hoursSinceLastEvent>24 AND hoursSinceLastEvent<48) | table host firstTime daysSinceFirstEvent lastTime daysSinceLastEvent hoursSinceLastEvent