I know that the predict functions become more accurate when you feed it more data but I don't want to be querying 2 months worth of data in a dashboard that would take like 2 mins to load. Is there a way to get a more accurate prediction without actively querying the past 2 months? or is there a way to do this differently with a different function. FYI I d not have authority to download the MLTK
I know this is a tough question but would like to hear some ideas.
index=summary source="summary_events_2"
orig_source=/var/log/pnr*
ms_region=us-west-1
ms_level=E*
| timechart span=15m sum(count) as count
| predict count as count_prediction period=7 algorithm=LLP5 future_timespan=10 holdback=0 upper50=high_prediction lower5=low_prediction
| rename high_prediction(count_prediction) as high_prediction
| eval deviation=count-round(count_prediction,0)
| streamstats window=300 current=true median(deviation) as median_of_residual
| eval abs_dev=(abs(deviation - median_of_residual))
| streamstats window=300 current=true median(abs_dev) as median_abs_dev
| eval upper_bound=if(median_of_residual + median_abs_dev * 5 < 0,abs(median_of_residual + median_abs_dev), median_of_residual + median_abs_dev * 5)
| eval anomaly=if(deviation > upper_bound,1,0)
| predict deviation as deviation_prediction period=7 algorithm=LLP5 future_timespan=0 holdback=0 upper20=high_prediction lower20=low_prediction
| fields - median_of_residual, median_abs_dev, abs_dev, high_prediction, bounds, count, count_prediction
I agree with @DalJeanis. In particular, if this is the only search like this, report acceleration is the easiest and best option for you. If you could use MLTK, you could do a one-time learning over a huge time span and true this up periodically, but that's out. Also, check out this INCREDIBLE answer by @mmodestino here:
https://answers.splunk.com/answers/511894/how-to-use-the-timewrap-command-and-set-an-alert-f.html
I agree with @DalJeanis. In particular, if this is the only search like this, report acceleration is the easiest and best option for you. If you could use MLTK, you could do a one-time learning over a huge time span and true this up periodically, but that's out. Also, check out this INCREDIBLE answer by @mmodestino here:
https://answers.splunk.com/answers/511894/how-to-use-the-timewrap-command-and-set-an-alert-f.html
@mmodestino explained it so well Thankss!!!
this is even better
https://www.splunk.com/blog/2018/01/19/cyclical-statistical-forecasts-and-anomalies-part-1.html
https://www.splunk.com/blog/2018/02/05/cyclical-statistical-forecasts-and-anomalies-part-2.html
https://www.splunk.com/blog/2018/03/20/cyclical-statistical-forecasts-and-anomalies-part-3.html
3 part blog series by much smarter folks than me 😉
This is a good use case for an accelerated report, accelerated data model or a summary index. If your report is going to be based on summarized 15m increments, then it makes more sense for the system to be calculating each 15m increment once, rather than going back two months to do so.
Start with accelerating the report, which should work for your use case.
ACCELERATED REPORT
https://docs.splunk.com/Documentation/Splunk/7.1.2/Report/Acceleratereports
ACCELERATED DATA MODEL
https://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Acceleratedatamodels
SUMMARY INDEXING
https://docs.splunk.com/Documentation/Splunk/7.1.2/Knowledge/Usesummaryindexing
https://www.splunk.com/view/SP-CAAACZW
i thought of using a summary index also but if run a summary index every 15m wouldn't it affect the accuracy of the predict. for example a query with predict that runs for 2 months would get a more accurate prediction compared to a 4 hours prediction, or am I misunderstanding the predict command. I am not sure however hoe the accelerated report works. I have read the documentation but I don't really know how that would solve my issue.
@kiamco - The summary index would contain the pre-summarized data. The predict could then run quickly across any length of time, and would not have to analyze the data at the event level ever again, which is what takes the majority of the CPU time.