Solved: How to troubleshoot why my deployment server CPU s...

tiny3001 · ‎02-18-2016

Hi everyone,

I have a stand-alone deployment server setup on a CentOS 7 Linux VM with 8 cores and 8GB RAM on Splunk 6.2.8. This server is currently managing about 150 clients, and in this setup, I cannot imagine scaling beyond 500 clients.

There are a few (about 30 clients) set up with deploymentclient.conf on the default phoneHomeIntervalInSecs, but the rest (after we realized) are set up with an interval of 600 (10 minutes).

The server is behaving strangely though. Every 10 minutes or so (imagine that), the CPU usage spikes considerably, and I get load averages ranging from 12-20. This lasts for about 5 minutes and then everything dies down and becomes normal for a few minutes.

Okay, so things I have tried:

Set ulimit -n = 8192 (confirmed for the splunk user)
Configured a DNS server as per known issues for 6.2.0 (Splunk Web becomes unreachable if an enabled deployment server in the same instance cannot access DNS. (SPL-28471))
Even though I'm running in standalone mode, I've disabled the "Deployment server" role for the Distributed Management Console. It only has "Indexer" and "Search Head" selected. (Do not host a distributed management console, which is essentially a search head, on a deployment server with more than 50 clients.) This was done based on the recommendations on this page.

The only thing that I can think of doing wrong is maybe the DMC setup? Other than that, the splunkd.log basically has a bunch of entries for broken pipes around every 10 minutes (which correlates to the problem) and then it chugs along happily for the next 5 minutes.

Surely I'm missing something silly.

Please help?

tiny3001 · ‎02-18-2016

Right, so this is a lesson in humility.

After doing some tests, giving this machine as much resources as possible, I went back to basic questions.

1) Why is THIS VM guest slow and the others are not?
2) Why is THIS VM guest struggling when the overall performance of the VM host is not really being tested?

Now when you ask those questions, you start looking away from the internals of your VM guest and you start looking at your VM host setup.

Somewhere in the past, a low resource reservation pool was created, and this particular VM was placed in that pool. In effect, it didn't matter how many cores or memory I assigned to it, it would never use the given hardware fully. We increased the reserved resources. The VM guest is now chugging along happily with a single core, 1 GB of RAM, and 250+ clients deployed to it.

Thank you all for the input. Maybe this could serve as a lesson to other people having 'issues with Splunk' on a VM.

View solution in original post

tiny3001 · ‎02-18-2016

Right, so this is a lesson in humility.

After doing some tests, giving this machine as much resources as possible, I went back to basic questions.

1) Why is THIS VM guest slow and the others are not?
2) Why is THIS VM guest struggling when the overall performance of the VM host is not really being tested?

Now when you ask those questions, you start looking away from the internals of your VM guest and you start looking at your VM host setup.

Somewhere in the past, a low resource reservation pool was created, and this particular VM was placed in that pool. In effect, it didn't matter how many cores or memory I assigned to it, it would never use the given hardware fully. We increased the reserved resources. The VM guest is now chugging along happily with a single core, 1 GB of RAM, and 250+ clients deployed to it.

Thank you all for the input. Maybe this could serve as a lesson to other people having 'issues with Splunk' on a VM.

ykou_splunk · ‎02-18-2016

I don't think the issue is related to DMC, because DMC only run ad-hoc searches when you open DMC dashboards or DMC overview page. If you don't use DMC, it should not cost any resource.

DMC contains some alerts, which are disabled by default, so that shouldn't cost any resource either.

Do you have scheduled search/alerts running on that splunk instance? You can actually verify that information by using the "Search Activity: Instance" dashboard in DMC, there's a panel in this dashboard showing the search concurrency over time.

dbcase · ‎02-18-2016

do you have the NIX app installed? If so make sure that it excludes the bash history as it will iterate over and over causing the CPU to spike

tiny3001 · ‎02-18-2016

Maybe I should be disabling DMC completely on this? How do I go about doing that?

tiny3001 · ‎02-18-2016

I've now hit a load average of 32, with one of my splunkd processes consuming 400% CPU... I'm really not sure what is going on...

How to troubleshoot why my deployment server CPU spikes every 10 minutes and slows down?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!