Hello all
As a splunk in an early station 😀 I currently have the following challenge:
We have many indexes and we want to do an analysis over all indexes how fast the log data is available in Splunk. As distance should be measured from writing the log (_time) to indexing time (_indextime). Also we want to exclude scatter (e.g. we currently have hosts with wrong time configuration, i.e. something like a Gaussian normal distribution).
Here is an example query, which is probably wrong or could be done much better by you:
| tstats latest(_time) AS logTime latest(_indextime) AS IndexTime WHERE index=bv* BY _time span=1h
| eval delta=IndexTime - logTime
| search (delta<1800 AND delta>0)
| table _time delta
Is the query approx correct so that we can answer the question what kind of deley we have over all? How could one use a Gaussian normal distribution instead of restricting the search manually?
On top of @ITWhisperer 's suggestion I'd rather not use tstats to produce just one value per hour bin but rather calculate average over that delta for hour or shorter periods. If you have lots of data you could use sampling to do it only on a small subset of events.
well, we do have a lot of data (currently approx 10 billion events per day and more, increasing). tstats is probably not the best idea to use here, but faster than just a normal search. I will try with sampling and have a look how i can use this.
An other idea is to do some saved searches for each index, store the results (_time, _indextime, index) into a summary index and then use this make some statistics. but with more than 100 indexes it will take some time, effort and Splunk resources. also i am not shure if this will make things easyier for me.
You could consider using the Machine Learning ToolKit (MLTK) which is a free add-on from SplunkBase.
You can set up models of your data e.g. Gaussian / Normal distributions, and then look for anomalies.