I am growing very tired of being asked to justify my "undocumented" and "bigoted" best-practice of NEVER deploying splunk infrastructure (Search Heads, Indexers, License Manager, Cluster Master, Deployer, Deployment Server, Monitoring Console, etc.) on any Windows OS. I am sure many of you have faced the same challenge. I have created this question so that we can create a canonical list from which we can all share the same URL where the best and brightest of us can share our past pain with the kind intention of helping others avoid the windows path of perfectly-avoidable regret
. If you think that you will use this Q&A as a reference point, then please do me-too
the question. If you have just cause to avoid Windows then P*L*E*A*S*E post your answer. Remember, friends don't let friends deploy on Windows: let's give them the facts that they need to successfully push back. Please include links to documented disasters when possible. Keep in mind that I probably will never accept
any answer to this question (to encourage others to participate in perpetuity). Let's do one objection per answer and vote on the best objections so that the most-important ones will filter to the top.
ATTENTION!!! ATTENTION!!!!
THERE ARE NOW MORE ANSWERS THAN FIT ON A SINGLE PAGE (NOTE PAGINATION
CONTROLS AT THE BOTTOM)!
Running multiple UF instances on one box is a via-support-only fragile affair under Windows, as opposed to just unpacking the tgz multiple times and setting some configs.
True, and this will be necessary if you are forwarding compressed files because the AEQ (AKA AQ, "Archive Queue") handler is single-threaded and becomes a HUGE bottleneck with even small numbers of *.zip
files. I once had ~30 forwarder instances installed on a single UF just to handle *.zip files coming in.
Sometimes the problem isn't a bug, rather things are unexpectedly "just different" in Windows. The following is a good Splunk documentation link to get to know differences between Unix and Windows operations:
This was going to be my main point. Note that some regex works different in windows and is undocumented... whitelisting and blacklisting regexes can be an arduous task of trial and error.
If you ever have blocked queues, you may find that your Indexers suddenly refuse to receive data from forwarders requiring the whole Indexer tier to be rebooted (does not happen on Linux Indexers):
There is some kind of intractable race-condition between the Windows Splunk service and many logging services such that a standard installation of Splunk can come up in such a state that events cannot be forwarded without experiencing corruption. The work-around is to delay the start of the Splunk service but even this does not always prevent the problem (although it usually does). Keep in mind that you need to monitor the OS on your Splunk infrastructure, too, so problems forwarding in security logs there are big problems. See here:
https://answers.splunk.com/answers/200924/formatmessage-error-appears-in-indexed-message-for.html
This makes Windows a risky option for Heavy Forwarder or Syslog+UF.
This one comes up a lot during patching cycles!
That's when I pull out my I told you so
card.
Windows permissions and file ownership, particularly on on indexers. I have had too many Bucketmover inflight
errors, because either LocalSystem
or an MSA
could not create, delete or rename folders. There are workarounds and you can routinely icacl.exe
it, but who has ever had to cron chmod
or chown
commands on their *NIX indexers? No one.
Most of the splunk documentation (and especially the training documentation) is *NIX-focused. Things are much better now but even so, in most classes that I attended (even last year) there was somebody on Windows whose cut-and-paste would not work because it was wrong. This is obvious because the instructors chat heads-up warnings about these problem to everyone.
The python interpreter in Windows is far slower than that in the Linuxes. I believe it mainly affects Enterprise Security - but nothing is broken just a lot of things take longer to run.
Possibly related, our environment is virtually unable to run Qmulos - Compliance on Windows Server 2016. Clicking any "Submit"-style button takes 10-20 seconds to load regardless of the button's function. It is now an officially recognized bug. The problem is not on any other Windows OS versions.
Thanks, just another push for Linux.
Most Splunk admins that I know have had many cases where Splunk Indexers and Search Heads have crashed due to memory leaks in the OS. I have NEVER seen this happen in *NIX (although I am sure that on rare occasion it has). Many *.0 releases of Splunk on Windows have contained a memory leak that made it through testing, but not *NIX.
I have now had this happen several times in NIX in the 7.*
releases (shame on splunk for not doing regression/capacity testing with bounds-checking).
If we get to the point where this has stabilized, I think it would behoove us to specify a reasonable range of releases where this was a problem (IIRC, 7.2.0 throught 7.2.3?). Just so we know it's not an ongoing problem.
All but one - I ran a single SH/Indexer box on Windows for years, from version 4.3 to - Oh, I may have even skipped 5.x entirely! - 6.0. I had no significant problems (possibly really none at all - I can't remember in that much detail!, but certainly nothing serious).
True, I do know you, @rich7177!
Yes. 🙂
Note I'm STILL not recommending Splunk on Windows* , just saying I had no problems in several years of running Splunk on Windows.
Small shop running Splunk Free with no use for more than 10 or 20 GB/day of license because they simply don't have that much stuff going on, then a Windows all-in-one box would probably be fine.
Small shops that have no Linux experience. Again, with a maybe up to 50 GB/day limit and no replication requirements.
Places with no real IT people, just a guy in the place that can take care of the few day to day things....
Wait, I see the common thread - Very small places (data-wise) with little to no Linux experience.
It is still a Windows "best practice" to have a monthly reboot (if not more frequently). I have seen Linux indexers that have an uptime of YEARS. Who can afford monthly Indexer downtime just so that the host OS doesn't crash?
A rolling reboot in a cluster shouldn't pose a big issue.