I am growing very tired of being asked to justify my "undocumented" and "bigoted" best-practice of NEVER deploying splunk infrastructure (Search Heads, Indexers, License Manager, Cluster Master, Deployer, Deployment Server, Monitoring Console, etc.) on any Windows OS. I am sure many of you have faced the same challenge. I have created this question so that we can create a canonical list from which we can all share the same URL where the best and brightest of us can share our past pain with the kind intention of helping others avoid the
windows path of perfectly-avoidable regret. If you think that you will use this Q&A as a reference point, then please do
me-too the question. If you have just cause to avoid Windows then P*L*E*A*S*E post your answer. Remember, friends don't let friends deploy on Windows: let's give them the facts that they need to successfully push back. Please include links to documented disasters when possible. Keep in mind that I probably will never
accept any answer to this question (to encourage others to participate in perpetuity). Let's do one objection per answer and vote on the best objections so that the most-important ones will filter to the top.
THERE ARE NOW MORE ANSWERS THAN FIT ON A SINGLE PAGE (NOTE
PAGINATION CONTROLS AT THE BOTTOM)!
This was going to be my main point. Note that some regex works different in windows and is undocumented... whitelisting and blacklisting regexes can be an arduous task of trial and error.
If you ever have blocked queues, you may find that your Indexers suddenly refuse to receive data from forwarders requiring the whole Indexer tier to be rebooted (does not happen on Linux Indexers):
There is some kind of intractable race-condition between the Windows Splunk service and many logging services such that a standard installation of Splunk can come up in such a state that events cannot be forwarded without experiencing corruption. The work-around is to delay the start of the Splunk service but even this does not always prevent the problem (although it usually does). Keep in mind that you need to monitor the OS on your Splunk infrastructure, too, so problems forwarding in security logs there are big problems. See here:
This makes Windows a risky option for Heavy Forwarder or Syslog+UF.
Windows permissions and file ownership, particularly on on indexers. I have had too many
Bucketmover inflight errors, because either
LocalSystem or an
MSA could not create, delete or rename folders. There are workarounds and you can routinely
icacl.exe it, but who has ever had to cron
chown commands on their *NIX indexers? No one.
Most of the splunk documentation (and especially the training documentation) is *NIX-focused. Things are much better now but even so, in most classes that I attended (even last year) there was somebody on Windows whose cut-and-paste would not work because it was wrong. This is obvious because the instructors chat heads-up warnings about these problem to everyone.
The python interpreter in Windows is far slower than that in the Linuxes. I believe it mainly affects Enterprise Security - but nothing is broken just a lot of things take longer to run.
Possibly related, our environment is virtually unable to run Qmulos - Compliance on Windows Server 2016. Clicking any "Submit"-style button takes 10-20 seconds to load regardless of the button's function. It is now an officially recognized bug. The problem is not on any other Windows OS versions.
Most Splunk admins that I know have had many cases where Splunk Indexers and Search Heads have crashed due to memory leaks in the OS. I have NEVER seen this happen in *NIX (although I am sure that on rare occasion it has). Many *.0 releases of Splunk on Windows have contained a memory leak that made it through testing, but not *NIX.
If we get to the point where this has stabilized, I think it would behoove us to specify a reasonable range of releases where this was a problem (IIRC, 7.2.0 throught 7.2.3?). Just so we know it's not an ongoing problem.
All but one - I ran a single SH/Indexer box on Windows for years, from version 4.3 to - Oh, I may have even skipped 5.x entirely! - 6.0. I had no significant problems (possibly really none at all - I can't remember in that much detail!, but certainly nothing serious).
Note I'm STILL not recommending Splunk on Windows* , just saying I had no problems in several years of running Splunk on Windows.
Small shop running Splunk Free with no use for more than 10 or 20 GB/day of license because they simply don't have that much stuff going on, then a Windows all-in-one box would probably be fine.
Small shops that have no Linux experience. Again, with a maybe up to 50 GB/day limit and no replication requirements.
Places with no real IT people, just a guy in the place that can take care of the few day to day things....
Wait, I see the common thread - Very small places (data-wise) with little to no Linux experience.
It is still a Windows "best practice" to have a monthly reboot (if not more frequently). I have seen Linux indexers that have an uptime of YEARS. Who can afford monthly Indexer downtime just so that the host OS doesn't crash?
I'm not disagreeing this was a problem ages ago nor am I in any way suggesting running Splunk on Windows, but I think this is a problem long past now.
I can say without any reservation that you can get years of up-time on Server 2008 and newer easily. Though obviously if you patch - which applies to both Linux and Windows - you'll be rebooting them at least occasionally.