What exactly is false positives, false negatives, true positives, true negatives means? How to identify them in Splunk and can we trigger them and how it is useful to us in monitoring Splunk? Please explain.
Thank you @tscroggins @ITWhisperer .
How to avoid latency of ingestion in Splunk? Like can we completely avoid these false positives and false negatives?
As @tscroggins says, it is not possible to "completely avoid" the false positives and false negatives. At the end of the day, as with a lot of things, it comes down to money. How much does it cost you / your organisation to respond to a positive alert only to find it was a false positive and therefore "wasted" cost? How much does it cost you / your organisation / your customers if you miss an "incident" due to a false negative? Lost orders? Damaged reputation? SLA breaches? These considerations can be taken into account when putting together a business case for improving your monitoring, taking on extra staff to respond to alerts, improving your infrastructure to reduce latency, rewriting your applications to be more robust and/or self-healing, etc. etc. Start looking too deeply and you won't sleep at night! Find a good enough / tolerable level of monitoring that gets you close but doesn't cost the earth!
Not directly, no. Even if the source, e.g. a web server, and the destination, e.g. a Splunk indexer, have perfectly synchronized clocks (they do not), the time (latency) it takes to share information between the source and the destination is greater than zero. That time is composed of reading the source clock, rendering the source event, writing the event to storage, reading the event from storage, serializing the event to the network, transmitting the event across the network, deserializing the event from the network, reading the destination clock, rendering the destination event, and writing the destination event to storage. The preceding list is not exhaustive and may vary. Just note that it takes time to go from A to B. There are delays everywhere!
You can search by _indextime instead of _time using _index_earliest and _index_latest and very wide earliest and latest parameters:
index=web status=400 earliest=0 latest=+1d _index_earliest=-15m@m _index_latest=@m
however, it's still possible to miss events that have been given an _indextime value of T but aren't synchronized to disk until after your search executes.
You can use real-time searches to see events as they arrive at indexers (or as they're written to storage, depending on the type of real-time search), but for your use case, time windows are still required, and events may still be missed.
A false positive is something that is reported as being true when it is false.
A false negative is something that is reported as being false when it is actually true.
In monitoring terms, this could be related to, for example, an alarm being raised when the condition / threshold has not been reached (false positive) or an alarm not being raised when the condition / threshold has been reached (false negative).
Both these situations should be avoided whenever possible, although for some environments, this is not always possible. If these perfect monitoring scenario cannot be reached, you have to decide at what point the number of false alarms are tolerable for your organisation.
Thank you @ITWhisperer
So on daily basis in splunk environment, what will be the most possible and frequent scenario in above 4 cases? How to avoid that? So you are saying false alerts will be triggered but condition will not be met...how is it possible?
What is the mechanism for false positives?
Example: from status=400 reaches count more than 5 in last 15 min alert should be triggered. We will correctly set the alert. But still why alert will be triggered? Is it malfunction of Splunk? I didn't get you. Is false positives generally happen?
Can you please more detail on this.
Thanks once again.
Hi @Karthikeya,
False positive, false negative, etc. have the same definitions in Splunk that they have in statistics.
I'm in the United States, and I find NIST/SEMATECH e-Handbook of Statistical Methods, Chapter 6, "Process or Product Monitoring and Control," a useful day-to-day reference: https://www.itl.nist.gov/div898/handbook/index.htm.
In your example, you're counting events. For example, a basic search scheduled to run every minute:
index=web status=400 earliest=-15m@m latest=@m
| status count
| where count>5
gives you the count of status=400 events over the prior 15 minutes.
In this context, false positive and false negative could relate to the time the events were generated and the delay between that time and the index time.
If a status=400 event occurred at 00:14:59 but was indexed by Splunk at 00:15:04, then a search that executes at 00:15:01 for the interval [00:00:00, 00:015:00) would not count the event because it has not been indexed by Splunk. This is a false negative.
You can reduce the probability of false negatives by adding a backoff to your search--1 minute in this example:
index=web status=400 earliest=-16m@m latest=-1m@m
| status count
| where count>5
However, that will not eliminate all false negatives because there is still a non-zero probability that an event will be indexed outside your search time range.
False positives are more typically associated with measuring against a model. Let's say you've modeled your application's behavior and determined that more than 5 status=400 events over a 15 minute interval likely indicates a client-side code deployment issue as opposed to "normal" client behavior. "More than 5" is associated with a control limit, for example a deviation from a mean; however, the number of status=400 events is a random variable. A bad client-side code deployment may trigger 4 status=400 events, which is a false negative, and a good client-side deployment may trigger 6 status=400 events, which is a false positive.
Several Splunk value-added products like Splunk Enterprise Security and Splunk IT Service Intelligence provide ready-to-run modeling and monitoring solutions, but in general, you would model your application's behavior using either traditional methods outside Splunk or statistical functions or an add-on like the Splunk Machine Learning Toolkit inside Splunk. You would then apply your model using custom Splunk searches.
It is not a malfunction of Splunk - false positives and negatives could arise if your monitoring solution is not robust enough for your requirements. For example, in your scenario, if you are monitoring every 15 minutes, let's say at 00, 15, 30 and 45 minutes past the hour but you get 400 errors at 12, 13, 14, 15, 16, and 17, you have 6 errors but 3 fall into 00-14 time bucket and 3 fall into 15-29 time bucket. Would you say this is a missed alert (false negative) or something you would tolerate? In another scenario, let's say you have errors occurring 13, 14, 15, 16, 28 and 29, but the 13 and 14 errors arrive late so they are picked up in the 15-29 time bucket, so you raise an alert, this might be seen as a false positive, i.e. an alert that you didn't really want. It all comes down to what your requirements are and what tolerances you are prepared to accept in your monitoring environment.