## Splunk Machine Learning Toolkit: Density Function Algorithm - What is considered as "anomaly"?

New Member

In the output of density function algorithm, Is an anomaly is data which depart from “normalcy”?

For example, if historical response time (for some web sever, say) is 500 milliseconds, then, is it true that a response of 100 milliseconds be considered an anomaly? Well, 100 milliseconds is “better” than 500 milliseconds, is it not? Sure. It’s different than 500 milliseconds, but it’s better because it’s faster. In other words, is there a “mechanism” in Splunk which precludes tagging that ‘pesky’ 100 milliseconds as ‘anomalous’ event? Sure-sure, 5,000 milliseconds is a bona-fide anomalous.

Splunk Employee

The "normalcy" of a value is determined by the likelihood of that value occurring according to your training data (past observations). For example, if 98% of your requests see a response time between 1000ms and 5000ms (according to your training data) then a response time between 0ms and 1000ms is only 2% likely, so it might (see below) be marked as anomalous.
Now, the parameter threshold is key here. When you set threshold to say, 0.05 you're telling the algorithm that you think the 5% least likely data points are anomalous. Notice how we're talking about probabilities and not actual values. So, in the above example if you set threshold=0.05 then a latency of 500ms is anomalous (because, remember, any value between 0ms and 1000ms is only 2% likely to occur, according to your training data and the statistical model that DensityFunction created for you).

The Latest From the Splunk Community!