How to build a machine learning model for high vol...

MG · ‎05-23-2023

I want to build a machine learning model to detect anomalies on the high volume ingestion index. The problem i'm facing is for the small indexes if i fit the model with DensityFunction. How to overcome this scenario

ITWhisperer · ‎05-24-2023

Part of the process for building a model is cleaning the data. You could consider the smaller indexes to be data you don't want in the model so you should remove them before fitting.

MG · ‎05-24-2023

But i need those small indexes as well

ITWhisperer · ‎05-24-2023

Build two models

Or, manipulate the ranges provided by the model to use the lower range for the smaller indexes and the higher range for the larger indexes. This assumes that the values are diverse enough that data fits into multiple ranges, and that you can determine which range to use for the anomaly detection - you may need to do this part "manually" rather than relying on the apply command. For example, in anomaly detection I have done in the past, I was only interested in anomalies which breached the highest value (I was looking at error rates so wasn't too concerned if there were fewer errors than normal!). I did this by evaluating the actual value against the highest value in the range and setting the anomaly flag accordingly.

MG · ‎05-25-2023

OK
Finally I have a table as below
|table _time index Count_Events hod dow lowerBound upperBound differencefromlowerbound differencefromupperbound Outlier_Low_index Outlier_High_index

from this table, I want to create an alert only when the search result from the table is giving the same index name more than twice. If I give the stats command as below, I'm getting the Indexcount as 1 eventhough the index name appears twice because the events are not same. How to do it?

|stats count as Indexcount by _time Events index hod dow lowerBound upperBound differencefromlowerbound differencefromupperbound Outlier_Low_index Outlier_High_index

ITWhisperer · ‎05-25-2023

Try removing the _time from the by clause?

MG · ‎05-25-2023

It did not help as the values are different for all the fields. Still got same index names twice

ITWhisperer · ‎05-25-2023

If the index name appear multiple times, it is because there is a difference in the other fields. For example, with hod (hour of day?) it is perhaps not surprising that the stats are difference for different times of the day or day of the week (dow?).

Depending on your data, this makes sense, you possibly want your model to take time into account so that you can pick up on anomalies for normal behaviour when compared to the same time periods. (This is certainly how I have used it.)

If you don't want the index to appear more than once, you need to remove the varying fields, but that could limit the relevance of the model.

How to build a machine learning model for high volume ingestion on indexes?

other

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio