I have seen several similar questions asked, but they are often answered in different ways so I'm hoping whoever answers this can explain why they created the search string the way they did.
I have multiple hosts and I want to create an alert if the count of events reduces by more than 2 standard deviations on a per hour basis for the last four weeks for each host. I have seen many examples that used buckets while others used timechart. My understanding is that bucket will not include entries that have a count of 0 so timechart should be used, is this correct?
| timechart span=1h count by host
| stats stdev(count) AS Stdev
| eval thresh=Stdev*2
| where count < thresh
| table host count
timechart does solve if an interval has 0 count, but in your query, once timechart command is applied, results reduce to _time, host1,host2 ....etc fields. So, stdev(count) doesnt yield any results, because there is no field with name count.
untable converts timechart/chart type of data to stats type format
| timechart span=1h count by host
| untable _time,host,count
| eventstats stdev(count) as std by host
| eval threshold=2*std
| where count>threshold
The general rule is this - start with the detail that you have. Group it at the level you want.
For patching holes that are found to be empty, you can use
cont=t, or you can use
stats and add records for the empty slots with, for example,
appendpipe. This technique works just as well when it is only the missing items that you care about,as might be the case here.
You have two different potential standard deviations, the ones for non-zero hours, and the ones for all hours. They have different values, and different meanings. The average and standard deviation for non-zero hours are obviously going to give you a higher threshold for a "high" number than the average and standard deviation for all hours. Since you are concentrating on low numbers, you probably want the
stdev() which include the zeroes.
IMPORTANT NOTE - Standard deviations are generally meaningless unless you also have the median or average as well.
So, here's the
index=is1 | fields _time host | eval _time = 3600*floor(_time/3600) | stats count as hourcount by host _time | rename COMMENT as "add a zero record for each combination of _time and host, then stats again" | appendpipe [| stats values(_time) as Times values(host) as host | mvexpand host | mvexpand Times | rename Times as _time | eval hourcount = 0 ] | stats sum(hourcount) as hourcount by host _time | rename COMMENT as "calculate the overall stdev and average" | eventstats avg(hourcount) as avghourcount stdev(hourcount) as stdevhourcount by host | where hourcount < avghourcount - 2*stdevhourcount
_time is equivalent to
| bin _time span=1h, except that coded as above that part of the search is streaming and distributable. I believe that with these specific options,
bin would be streaming and distributable as well, but I haven't tested it.
2) Everything above should be equivalent to this...
index=is1 | timechart span=1h cont=t count as hourcount by host | eventstats avg(hourcount) as avghourcount stdev(hourcount) as stdevhourcount by host | where hourcount < avghourcount - 2*stdevhourcount
... which certainly looks a lot cleaner, but I'm never confident that
timechart isn't going to pull some shinanigans. Try them against each other, and if they give the exact same results, then use this one. If not, tell use the difference and we'll figure out what
timechart is playing at.
Updated to note that
bin with these options is probably streaming and distributable.