I have inherited PoC Splunk Search Head on Windows 2016 that has a Splunk Indexer cluster as its search peers. Unfortunately rebuilding it from the ground up is not an option.
The Splunk SH continually crashes after midnight with the following error from splunkd.log:
05-26-2018 00:03:36.425 +1000 WARN Thread - DistributedPeerMonitorThread: about to throw a ThreadException: _beginthreadex: The paging file is too small for this operation to complete.; 68 threads active 05-26-2018 00:03:39.894 +1000 WARN Thread - indexerPipe: about to throw a ThreadException: _beginthreadex: The paging file is too small for this operation to complete.; 68 threads active 05-26-2018 00:03:39.894 +1000 ERROR pipeline - Runtime exception in pipeline=indexerPipe processor=indexer error='indexerPipe: about to throw a ThreadException: _beginthreadex: The paging file is too small for this operation to complete.; 68 threads active' confkey=''
The crash log contains the following snippet:
Crashing thread: DistributedPeerMonitorThread
I have a bunch (10 or so) of searches to kick off at midnight but I have given them a schedule window of 6 hours to run. The paging file is currently set to 12 GB. The memory on the Search Head is 24 GB.
Any ideas on how I can make this PoC a success?
Well, a few thoughts:
You might be hitting bug SPL-154138, SPL-154544, SPL-154542, FAST-9662 that you can read about in the release docs.
Then again, maybe it's something else. SPL is a language with a lot of ways to do a single thing, but not all of them are created equal. General rules of thumb exist, like "avoid join like the plague, unless you absolutely cannot not use it" but sometimes it's a bit more subtle so takes a little thinking and working to avoid certain problems. Often, if your searches were created by your SQL people... well. SQL is great, but SPL is not SQL and if you do SPL things like you would SQL things, you will have a bad time.
In any case, here's what I'd do in your case.
1) Launch each of those searches manually one by one, watching run time, CPU and memory consumption of each. One or more of those is the culprit, I'm pretty sure. Note the run-time and memory of each, we may want that information later. (Memory - just basic numbers needed without a whole lot of precision, no need for anything more than "Wow, that one sucked up 8 GB according to Windows!")
2) If none of those searches by themselves make the system misbehave, run 5 of them at a time to see if you can break it with half of them, then try the other 5 - some combination may be responsible.
What to do:
Well, first off, can you add more RAM? You can stick hundreds of GB into a Splunk server and it only ever helps. Splunk's recommendation is ... marginal. But, I don't think adding RAM is the right solution here, I just include it for completeness' sake.
If you find ONE search that breaks things, post it here and we can see what may be done to fix it's inefficiencies! I'm guessing your search does too much work on the SH side and can't distribute the search's work out to the Indexers. There are so many reasons this could be the case that I can't list them here, but seeing the search involved will certainly let us know.
If it only breaks in a combination, maybe schedule all the searches to run at different times with enough space between them so they don't overlap. You did write down the run-times of each of the searches when testing before, didn't you? 🙂 If the UI version of scheduling doesn't give you good enough options for this, you can use cron schedules, like `15 2 * * *' which says to run at 2:15 AM every day. OR, even if it only breaks in combination, I'm sure there are one or two searches that are the culprits, so maybe even noticing which cause huge amounts of RAM consumption - maybe we just fix those two or three searches and see if that doesn't solve the whole problem.
Let us know how this goes,