I am growing very tired of being asked to justify my "undocumented" and "bigoted" best-practice of NEVER using realtime searches in Splunk. I am sure many of you have faced the same challenge. I have created this question so that we can create a canonical list for which we can all share the same URL where the best and brightest of us can share our past pain with the kind intention of helping others avoid the realtime path of perfectly-avoidable regret. If you think that you will use this Q&A as a reference point, then please do me-too the question. If you have just cause to avoid realtime then PLEAS*E post your answer. Remember, friends don't let friends run realtime searches: let's give them the facts that they need to successfully push back. Please include links to documented disasters when possible. Keep in mind that I probably will never accept any answer to this question (to encourage others to participate in perpetuity). Let's do one objection per answer and vote on the best objections so that the most-important ones will filter to the top.
Once user figures out that they can create RT searches and alerts, they go crazy on them and suddenly your deployment is running dozens of RT alerts firing duplicate messages and alerts.
Your Splunk performance starts to suck, your users get inundated with duplicate (email) alerts and filter them into folders which they ignore, causing a drain on multiple system resources, for very little benefit.
Despite what your salesman told you,
SPLUNK IS NOT A REALTIME PRODUCT! It does realtime very poorly because fundamentally, Splunk intensely cares about when an event actually happened. Because there is always delay in event generation and event delivery into any system, the only way to "do realtime" with any effectiveness is to pretend that events "happened" when the system receives them. Unless you are using
DATETIME_CONFIG = CURRENT which forces Splunk to lie to itself (please, Please, PLEASE do not do this), Splunk does not work that way. This means that your use of realtime in Splunk is almost certainly NOT doing/capturing what you think that it is (see other answers).
With default settings, a single realtime search run by a single user
permanently locks 1 CPU core on the Search Head and also 1 CPU core ON EACH AND EVERY INDEXER for the duration of the search. When a realtime search is running, the browser session will not timeout, like other long-running searches do, which means until the user kills it or the browser session, these will stay locked forever. That is why realtime search is BY FAR the worst thing that you can allow users to do inside of your Splunk environment. Just imagine the resource drain that a dashboard full of realtime searches would inflict.
Imagine that your event pipeline has a latency of 3 minutes (don't laugh; this is a typical average pipeline delay). and you are running a 1-minute realtime search window.
You will never see your events because by the time they become available to the indexer for reporting, they are already outside of your search window. So your search isn't even doing what you think that it is doing, and it is wasting resources to boot.
To see your latency, try this search:
| tstats count min(_indextime) AS min, avg(_indextime) AS avg, max(_indextime) AS max WHERE index=* BY sourcetype _time span=1s | foreach min avg max [ eval <<FIELD>> = <<FIELD>> - _time] | bin span=30m _time AS tmp | eventstats sum(count) AS sum BY sourcetype tmp | eval avg = count * avg / sum | timechart span=30m min(min) AS min sum(avg) AS avg max(max) AS max BY sourcetype
the worst of all situations with realtime is a
scheduled realtime search. If you are going to allow ad-hoc realtime searches, then for the love of your own job as Splunk admin, disable scheduled reatime searches. With default settings, a single realtime search run by a single user
permanently locks 1 CPU core on the Search Head and also 1 CPU core ON EACH AND EVERY INDEXER for the duration of the search. For scheduled realtime searches, that means FOREVER AND EVER AND EVER. This cripples the system very quickly. In particular, beware Enterprise Security (ES), which ships with realtime correlation searches. ES should have been configured to use a "fake" realtime setting that does not lock cores but if you installed ES yourself, you probably did not do this.
This is the complaint that seems to be the most damning. Taking this to the letter, it seems that realtime searches are capped at the number of cores on the search heads or the indexers, whichever is lower, and reaching anything close to that cap will severely impact the environment.
Has there ever been recognition of this by splunk, and any attempts to develop around it?
See my other answer on how to disable it. In there you will see
indexed_realtime_use_by_default = true which causes splunk to do "mostly realtime" and not lock cores. This is their attempt to remedy the worst parts but it ships with default value of