I am growing very tired of being asked to justify my "undocumented" and "bigoted" best-practice of NEVER using realtime searches in Splunk. I am sure many of you have faced the same challenge. I have created this question so that we can create a canonical list for which we can all share the same URL where the best and brightest of us can share our past pain with the kind intention of helping others avoid the realtime path of perfectly-avoidable regret. If you think that you will use this Q&A as a reference point, then please do me-too the question. If you have just cause to avoid realtime then PLEAS*E post your answer. Remember, friends don't let friends run realtime searches: let's give them the facts that they need to successfully push back. Please include links to documented disasters when possible. Keep in mind that I probably will never accept any answer to this question (to encourage others to participate in perpetuity). Let's do one objection per answer and vote on the best objections so that the most-important ones will filter to the top.
I don't know if it was truly necessary but
realtime and won't work without it. Like any product, marketing is important and
realtime is one of those key buzzwords. If it is enabled (I generally disable it system-wide), I occasionally will use it ad-hoc when recovering from an outage of some kind or when onboarding an important new datasource, watching for the change to take effect so that I can announce immediately when "it" happens. IMHO, there is never a good usecase for it other than ad-hoc.
I always ask the customer, that creates dashboard with realtime searches, if they ever : look away from the screen, get something to drink, go too the bathroom, go out for lunch, ect, ect. If it's YES , they don't need realtime 🙂
For customers that creat relatime alerts, with email output. Just aks how often the check the mail. If the say "always", start emailing them in the weekend, and complain why did didn't respond within a few second
When somebody is insisting that realtime alerting is required I ask if they are using fully-automated response systems or humans. The answer is generally "humans". Then I ask, how quickly can your best guy receive, process and understand, then react to, a typical alert notification. The answer is usually something like "5 minutes". I then follow up with, OK, then let's setup a search that runs every 5-minutes. This approach is admittedly somewhat specious and slightly a canard but it is generally effective.
So how does one disable realtime? I am glad that you asked!
On Search Heads in authorize.conf:
[default] # https://answers.splunk.com/answers/734767/why-does-everybody-hate-realtime-searches-what-is.html # Kill all ability to do realtime (rt) searches because each one # permanently locks 1 CPU core on Search Head and EACH Indexer! # Also set this for EVERY existing role. rtsearch = disabled schedule_rtsearch = disabled
On Search Heads in limits.conf:
[realtime] # https://answers.splunk.com/answers/734767/why-does-everybody-hate-realtime-searches-what-is.html # https://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutrealtimesearches#Indexed_real-time_s... indexed_realtime_use_by_default = true
On Search Heads in ui-prefs.conf:
# https://answers.splunk.com/answers/734767/why-does-everybody-hate-realtime-searches-what-is.html # Disable search app's homepage's real-time searches display.prefs.enableMetaData=0 display.prefs.showDataSummary=0
the worst of all situations with realtime is a
scheduled realtime search. If you are going to allow ad-hoc realtime searches, then for the love of your own job as Splunk admin, disable scheduled reatime searches. With default settings, a single realtime search run by a single user
permanently locks 1 CPU core on the Search Head and also 1 CPU core ON EACH AND EVERY INDEXER for the duration of the search. For scheduled realtime searches, that means FOREVER AND EVER AND EVER. This cripples the system very quickly. In particular, beware Enterprise Security (ES), which ships with realtime correlation searches. ES should have been configured to use a "fake" realtime setting that does not lock cores but if you installed ES yourself, you probably did not do this.
This is the complaint that seems to be the most damning. Taking this to the letter, it seems that realtime searches are capped at the number of cores on the search heads or the indexers, whichever is lower, and reaching anything close to that cap will severely impact the environment.
Has there ever been recognition of this by splunk, and any attempts to develop around it?
See my other answer on how to disable it. In there you will see
indexed_realtime_use_by_default = true which causes splunk to do "mostly realtime" and not lock cores. This is their attempt to remedy the worst parts but it ships with default value of
Imagine that your event pipeline has a latency of 3 minutes (don't laugh; this is a typical average pipeline delay). and you are running a 1-minute realtime search window.
You will never see your events because by the time they become available to the indexer for reporting, they are already outside of your search window. So your search isn't even doing what you think that it is doing, and it is wasting resources to boot.
To see your latency, try this search:
| tstats count min(_indextime) AS min, avg(_indextime) AS avg, max(_indextime) AS max WHERE index=* BY sourcetype _time span=1s | foreach min avg max [ eval <<FIELD>> = <<FIELD>> - _time] | bin span=30m _time AS tmp | eventstats sum(count) AS sum BY sourcetype tmp | eval avg = count * avg / sum | timechart span=30m min(min) AS min sum(avg) AS avg max(max) AS max BY sourcetype
With default settings, a single realtime search run by a single user
permanently locks 1 CPU core on the Search Head and also 1 CPU core ON EACH AND EVERY INDEXER for the duration of the search. When a realtime search is running, the browser session will not timeout, like other long-running searches do, which means until the user kills it or the browser session, these will stay locked forever. That is why realtime search is BY FAR the worst thing that you can allow users to do inside of your Splunk environment. Just imagine the resource drain that a dashboard full of realtime searches would inflict.
Despite what your salesman told you,
SPLUNK IS NOT A REALTIME PRODUCT! It does realtime very poorly because fundamentally, Splunk intensely cares about when an event actually happened. Because there is always delay in event generation and event delivery into any system, the only way to "do realtime" with any effectiveness is to pretend that events "happened" when the system receives them. Unless you are using
DATETIME_CONFIG = CURRENT which forces Splunk to lie to itself (please, Please, PLEASE do not do this), Splunk does not work that way. This means that your use of realtime in Splunk is almost certainly NOT doing/capturing what you think that it is (see other answers).
Once user figures out that they can create RT searches and alerts, they go crazy on them and suddenly your deployment is running dozens of RT alerts firing duplicate messages and alerts.
Your Splunk performance starts to suck, your users get inundated with duplicate (email) alerts and filter them into folders which they ignore, causing a drain on multiple system resources, for very little benefit.