Splunk Search

How is the unexpectedness score of the anomalies command calculated?

stephanbuys
Path Finder

How does the unexpectedness score actually get computed? How does the anolamies command play out if I have n events? http://www.splunk.com/base/Documentation/latest/SearchReference/Anomalies

1 Solution

carasso
Splunk Employee
Splunk Employee

The algorithm is proprietary, but roughly speaking, the unexpectedness of an event X coming after a set of previous events P is estimated as:

 u(X | P) =  ( s(P and X) - s(P) ) /  ( s(P) + s(X) )

where s() is a metric of how similar or uniform the data is. The above formula tends to be less noisy on real data than others tried, as we just want a measure of how much adding X affects similarity, but need to normalize for differing event sizes.

The size of the sliding window of previous events P, is determined by a 'maxvalues' argument, which defaults to 100 events. By default, the raw text (_raw) of the events are used, but any other field can be used with the 'field' argument. By default, it removes those items that have an unexpectedness greater than the "threshold" arguments, which defaults to 0.01; if the 'labelonly' argument is set to true, it only annotates the events with an unexpectedness score, rather than removing the "boring" events.

You can run anomalies after anomalies, to further narrow down the results. As each run operates over 100 events, the second call to anomalies is approximating running over a window of 10,000 previous events.

Finally, nothing beats domain knowledge. If you know what you are looking for, it might make sense to write your own search command to find your anomalies.

View solution in original post

carasso
Splunk Employee
Splunk Employee

The algorithm is proprietary, but roughly speaking, the unexpectedness of an event X coming after a set of previous events P is estimated as:

 u(X | P) =  ( s(P and X) - s(P) ) /  ( s(P) + s(X) )

where s() is a metric of how similar or uniform the data is. The above formula tends to be less noisy on real data than others tried, as we just want a measure of how much adding X affects similarity, but need to normalize for differing event sizes.

The size of the sliding window of previous events P, is determined by a 'maxvalues' argument, which defaults to 100 events. By default, the raw text (_raw) of the events are used, but any other field can be used with the 'field' argument. By default, it removes those items that have an unexpectedness greater than the "threshold" arguments, which defaults to 0.01; if the 'labelonly' argument is set to true, it only annotates the events with an unexpectedness score, rather than removing the "boring" events.

You can run anomalies after anomalies, to further narrow down the results. As each run operates over 100 events, the second call to anomalies is approximating running over a window of 10,000 previous events.

Finally, nothing beats domain knowledge. If you know what you are looking for, it might make sense to write your own search command to find your anomalies.

Get Updates on the Splunk Community!

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...