Other Usage

How to build a machine learning model for high volume ingestion on indexes?

MG
Engager

I want to build a machine learning model to detect anomalies on the high volume ingestion index. The problem i'm facing is for the small indexes if i fit the model with DensityFunction. How to overcome this scenario

 

Labels (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Part of the process for building a model is cleaning the data. You could consider the smaller indexes to be data you don't want in the model so you should remove them before fitting.

0 Karma

MG
Engager

But i need those small indexes as well

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Build two models

Or, manipulate the ranges provided by the model to use the lower range for the smaller indexes and the higher range for the larger indexes. This assumes that the values are diverse enough that data fits into multiple ranges, and that you can determine which range to use for the anomaly detection - you may need to do this part "manually" rather than relying on the apply command. For example, in anomaly detection I have done in the past, I was only interested in anomalies which breached the highest value (I was looking at error rates so wasn't too concerned if there were fewer errors than normal!). I did this by evaluating the actual value against the highest value in the range and setting the anomaly flag accordingly.

0 Karma

MG
Engager

OK
Finally I have a table as below
|table _time index Count_Events hod dow lowerBound upperBound differencefromlowerbound differencefromupperbound Outlier_Low_index Outlier_High_index

from this table, I want to create an alert only when the search result from the table is giving the same index name more than twice. If I give the stats command as below, I'm getting the Indexcount as 1 eventhough the index name appears twice because the events are not same. How to do it?

|stats count as Indexcount by _time Events index hod dow lowerBound upperBound differencefromlowerbound differencefromupperbound Outlier_Low_index Outlier_High_index


0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Try removing the _time from the by clause?

0 Karma

MG
Engager

It did not help as the values are different for all the fields. Still got same index names twice

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

If the index name appear multiple times, it is because there is a difference in the other fields. For example, with hod (hour of day?) it is perhaps not surprising that the stats are difference for different times of the day or day of the week (dow?).

Depending on your data, this makes sense, you possibly want your model to take time into account so that you can pick up on anomalies for normal behaviour when compared to the same time periods. (This is certainly how I have used it.)

If you don't want the index to appear more than once, you need to remove the varying fields, but that could limit the relevance of the model.

0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...