<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to calculate statistical outliers over a 90d@h sliding window? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596257#M207565</link>
    <description>&lt;P&gt;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/6367"&gt;@bowesmana&lt;/a&gt;&lt;/P&gt;&lt;P&gt;Thank you for the reply and suggestions.&lt;/P&gt;&lt;P&gt;We will investigate using&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;accelerated_fields.&amp;nbsp; TY!&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;We currently store a lot of statistical info in the lookup= "Foo_Count-upperBound", including "mean" and "stdev".&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;RE:&amp;nbsp;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;SPAN&gt;I would have also stored the average + stdev in the lookup for the host, so that the 2 * variance can be done in the outlier detection rather than baking the 2* factor into the lookup table&lt;/SPAN&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&amp;nbsp;are you suggesting this to improve performace?&amp;nbsp; or for tuning the threshold of the outlier?&lt;/P&gt;&lt;P&gt;I think I know what you mean, sort of cleans it up more....&amp;nbsp; TY!&lt;/P&gt;</description>
    <pubDate>Tue, 03 May 2022 13:37:01 GMT</pubDate>
    <dc:creator>Glasses</dc:creator>
    <dc:date>2022-05-03T13:37:01Z</dc:date>
    <item>
      <title>How to calculate statistical outliers over a 90d@h sliding window?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596193#M207542</link>
      <description>&lt;P&gt;Scenario:&lt;BR /&gt;We have a data source of interest that we wish to analyze.&lt;BR /&gt;The data source is hourly host activity events.&lt;BR /&gt;An endpoint agent installed on a user's host monitors for specific events.&lt;BR /&gt;The endpoint agent reports theses events to the central server, aka manager/collector.&lt;BR /&gt;Then the central server sends the data/events to Splunk for ingest.&lt;BR /&gt;We found that a distinct count of specific action events per hour per host is very interesting to us.&lt;BR /&gt;If the hourly count per user is greater than "the normal behavior" average then we want to be alerted.&lt;BR /&gt;We define normal behavior as the "90 day average of distinct hourly counts per host/user".&lt;BR /&gt;We define an outlier/alert as an hourly distinct count above 2 standard deviations from the 90day hourly average.&lt;BR /&gt;For instance, if the 90 day hourly average is 2 events for a host, then 10 events in a single hour for that host would fire an alert.&lt;/P&gt;&lt;P&gt;We tried many different methods and found some anomalies.&lt;BR /&gt;One issue is the events' arrival time to Splunk.&lt;BR /&gt;Specifically, the data does not always arrive to Splunk in a consistent interval.&lt;BR /&gt;The endpoint agent may be delayed in processing or sending the data to the central server if the network connection is lost or the running host was suspended/shutdown shortly after the events of interest occurred.&amp;nbsp;&amp;nbsp;We have accepted this issue as its very infrequent.&lt;/P&gt;&lt;P&gt;Methodology:&lt;BR /&gt;In order to conduct our analysis we have multiple phases.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Phase 1 &amp;gt; prepare the data and output to KVstore lookup&lt;/STRONG&gt;&lt;BR /&gt;We run a query to prime the historic data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index=foo earliest=-90d@h latest=-1h@h foo_event=* host=*
| timechart span=1h dc(foo_event) as Foo_Count by host limit=0 
| untable _time host Foo_Count |outputlookup 90d-Foo_Coun&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then we modify and save the query to append the new data, &lt;STRONG&gt;we use the -2h@h and -1h@h to mitigate lagging events.&lt;/STRONG&gt;&amp;nbsp;&amp;nbsp;This report runs first every hour at &lt;STRONG&gt;minute=0.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index=foo earliest=-2@h latest=-1h@h foo_event=* host=*
| timechart span=1h dc(foo_event) as Foo_Count by host limit=0 
| untable _time host Foo_Count |outputlookup 90d-Foo_Count append=t&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;Phase 2 &amp;gt; calculate the upperBound for each user&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;This report runs second every hour at &lt;STRONG&gt;minute=15.&amp;nbsp; &lt;/STRONG&gt;We add additional statistics for investigation purposes.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;|inputlookup 90d-Foo_Count |timechart span=1h values(Foo_Count) as Foo_Count by host limit=0 | untable _time host Foo_Count 
| stats min(Foo_Count) as Mini max(Foo_Count) as Maxi mean(Foo_Count) as Averg stdev(Foo_Count) as sdev median(Foo_Count) as Med mode(Foo_Count) as Mod range(Foo_Count) as Rng by host
| eval upperBound=(Averg+sdev*exact(2)) | outputlookup Foo_Count-upperBound&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Phase 3 &amp;gt; trim the oldest data to maintain a 90d@h interval&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;This report runs third every hour at &lt;STRONG&gt;minute=30.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;|inputlookup 90d-Foo_Count | eval trim_time = relative_time(now(),"-90d@h") | where _time&amp;gt;trim_time | convert ctime(trim_time) |outputlookup 90d-Foo_Count&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Phase 4 &amp;gt; detect outliers&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;This alert runs fourth (last) every hour the &lt;STRONG&gt;minute=45&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;index=foo earliest=-1h@h latest=@h foo_event=* host=* 
| stats dc(foo_event) as as Foo_Count by host limit=0 
| lookup Foo_Count-upperBound host output upperBound | eval isOutlier=if('Foo_Count' &amp;gt; upperBound, 1, 0)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This method is successful alerting on outliers.&lt;BR /&gt;RE: event lag, we monitor and keep track of how significant.&lt;/P&gt;&lt;P&gt;Originally, we tried using the MLTK with a DensityFunction and partial fit, however we have approximately 65 million data points which causes issues with the Smart Outlier Detection assistant.&lt;/P&gt;&lt;P&gt;The question is whether anyone has a different or more efficient way to do this?&lt;/P&gt;&lt;P&gt;Thank you for your time!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 03 May 2022 13:53:33 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596193#M207542</guid>
      <dc:creator>Glasses</dc:creator>
      <dc:date>2022-05-03T13:53:33Z</dc:date>
    </item>
    <item>
      <title>Re: How to calculate statistical outliers over a 90d@h sliding window?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596206#M207546</link>
      <description>&lt;P&gt;Nice description of your process.&amp;nbsp;&lt;/P&gt;&lt;P&gt;You've solved a key issue of the performance cost of calculating the rolling average if you have lots of data points with the store to kv.&lt;/P&gt;&lt;P&gt;Unless you can fix the lag issue, you have to deal with it as you have done, with the 1h delay.&lt;/P&gt;&lt;P&gt;I have had situations in the past, where a single average over the 90 days is not useful, in that different days/times of day, are significant, so I've used time bins for the lookup.&lt;/P&gt;&lt;P&gt;Are you using&amp;nbsp;&lt;SPAN class=""&gt;accelerated_fields in your kv store on host - if the 65m datapoints are in the kv store, then it may improve lookup times, or at least reduce the impact on the host through the additional index.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;I would have also stored the average + stdev in the lookup for the host, so that the 2 * variance can be done in the outlier detection rather than baking the 2* factor into the lookup table, but that may not be useful in your use case.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 03 May 2022 01:49:11 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596206#M207546</guid>
      <dc:creator>bowesmana</dc:creator>
      <dc:date>2022-05-03T01:49:11Z</dc:date>
    </item>
    <item>
      <title>Re: How to calculate statistical outliers over a 90d@h sliding window?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596257#M207565</link>
      <description>&lt;P&gt;&lt;a href="https://community.splunk.com/t5/user/viewprofilepage/user-id/6367"&gt;@bowesmana&lt;/a&gt;&lt;/P&gt;&lt;P&gt;Thank you for the reply and suggestions.&lt;/P&gt;&lt;P&gt;We will investigate using&amp;nbsp;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;accelerated_fields.&amp;nbsp; TY!&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;We currently store a lot of statistical info in the lookup= "Foo_Count-upperBound", including "mean" and "stdev".&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;RE:&amp;nbsp;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;SPAN&gt;I would have also stored the average + stdev in the lookup for the host, so that the 2 * variance can be done in the outlier detection rather than baking the 2* factor into the lookup table&lt;/SPAN&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&amp;nbsp;are you suggesting this to improve performace?&amp;nbsp; or for tuning the threshold of the outlier?&lt;/P&gt;&lt;P&gt;I think I know what you mean, sort of cleans it up more....&amp;nbsp; TY!&lt;/P&gt;</description>
      <pubDate>Tue, 03 May 2022 13:37:01 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596257#M207565</guid>
      <dc:creator>Glasses</dc:creator>
      <dc:date>2022-05-03T13:37:01Z</dc:date>
    </item>
    <item>
      <title>Re: How to calculate statistical outliers over a 90d@h sliding window?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596313#M207584</link>
      <description>&lt;P&gt;My thoughts on&amp;nbsp; storing base stdev is about tuning the outlier threshold. I can choose my outlier range in the dashboard, which I may want to be 1.8 or 2.2 or... In practice, if you know that upperbound is stored as 2*stdev, then you can always recalculate anything from that anyway.&lt;/P&gt;&lt;P&gt;But that's all about use case. There's unlikely any significant performance impact in delaying the calculation, but I&amp;nbsp;tend to work more with dashboards for triage than predefined reports, hence my bias.&lt;/P&gt;&lt;P&gt;Accelerated fields can be found here&lt;/P&gt;&lt;P&gt;&lt;A href="https://dev.splunk.com/enterprise/docs/developapps/manageknowledge/kvstore/usingconfigurationfiles/#Accelerate-fields" target="_blank"&gt;https://dev.splunk.com/enterprise/docs/developapps/manageknowledge/kvstore/usingconfigurationfiles/#Accelerate-fields&lt;/A&gt;&lt;/P&gt;&lt;P&gt;and they can make a significant difference to lookup performance.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 03 May 2022 22:10:02 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596313#M207584</guid>
      <dc:creator>bowesmana</dc:creator>
      <dc:date>2022-05-03T22:10:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to calculate statistical outliers over a 90d@h sliding window?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596317#M207585</link>
      <description>&lt;P&gt;Thank you for the reply that makes sense.&lt;/P&gt;</description>
      <pubDate>Tue, 03 May 2022 22:22:13 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-to-calculate-statistical-outliers-over-a-90d-h-sliding/m-p/596317#M207585</guid>
      <dc:creator>Glasses</dc:creator>
      <dc:date>2022-05-03T22:22:13Z</dc:date>
    </item>
  </channel>
</rss>

