I have a stand-alone deployment server setup on a CentOS 7 Linux VM with 8 cores and 8GB RAM on Splunk 6.2.8. This server is currently managing about 150 clients, and in this setup, I cannot imagine scaling beyond 500 clients.
There are a few (about 30 clients) set up with
deploymentclient.conf on the default
phoneHomeIntervalInSecs, but the rest (after we realized) are set up with an interval of 600 (10 minutes).
The server is behaving strangely though. Every 10 minutes or so (imagine that), the CPU usage spikes considerably, and I get load averages ranging from 12-20. This lasts for about 5 minutes and then everything dies down and becomes normal for a few minutes.
Okay, so things I have tried:
The only thing that I can think of doing wrong is maybe the DMC setup? Other than that, the
splunkd.log basically has a bunch of entries for broken pipes around every 10 minutes (which correlates to the problem) and then it chugs along happily for the next 5 minutes.
Surely I'm missing something silly.
Maybe I should be disabling DMC completely on this? How do I go about doing that?
I've now hit a load average of 32, with one of my splunkd processes consuming 400% CPU... I'm really not sure what is going on...
do you have the NIX app installed? If so make sure that it excludes the bash history as it will iterate over and over causing the CPU to spike
I don't think the issue is related to DMC, because DMC only run ad-hoc searches when you open DMC dashboards or DMC overview page. If you don't use DMC, it should not cost any resource.
DMC contains some alerts, which are disabled by default, so that shouldn't cost any resource either.
Do you have scheduled search/alerts running on that splunk instance? You can actually verify that information by using the "Search Activity: Instance" dashboard in DMC, there's a panel in this dashboard showing the search concurrency over time.
Right, so this is a lesson in humility.
After doing some tests, giving this machine as much resources as possible, I went back to basic questions.
1) Why is THIS VM guest slow and the others are not?
2) Why is THIS VM guest struggling when the overall performance of the VM host is not really being tested?
Now when you ask those questions, you start looking away from the internals of your VM guest and you start looking at your VM host setup.
Somewhere in the past, a low resource reservation pool was created, and this particular VM was placed in that pool. In effect, it didn't matter how many cores or memory I assigned to it, it would never use the given hardware fully. We increased the reserved resources. The VM guest is now chugging along happily with a single core, 1 GB of RAM, and 250+ clients deployed to it.
Thank you all for the input. Maybe this could serve as a lesson to other people having 'issues with Splunk' on a VM.