Splunk Enterprise

Splunk Instances Utilizing HIGH CPU Usage in VMware Environment

sokngoc
Explorer

Hi Everyone,

Any help would be appreciated. We have 4 Splunk instances that work together in tandem.

All four servers are Virtual Machines running Red Hat Enterprise Linux 8 Splunk Enterprise 8.2.2. VCenter is 6.7 with 4 ESXI Host each running 6.7 as well.

 

The Four Splunk VMs are running very high CPU capacity at all times:

45.8 GHz

83.44 GHz

45.6 GHz

83.82 GHz

 

It is basically running our ESXi Hosts to full capacity. I logged onto each server and ran the top -i command and each server states very low CPU usage.

Does anyone have any recommendations? Any help would be greatly appreciated.

 

Thank you,

0 Karma
1 Solution

isoutamo
SplunkTrust
SplunkTrust

Hi

when you are running Splunk on VMWare VM's I have seen that in some cases, if you have reserve too much vCores + mem for individual VMs, this could be the end result. You should remember that when VMWare schedules those nodes it needs to clean memory (on EXSi) etc. before it start to run "new" vm. Especially if you have overbooked you ESXi nodes as it's usually recommended you have "shoot your foot" ;-( with splunk nodes.

You should try to decrease number of vCores and memory if possible and then check the situation. Usually  I try to us as less resources on VMs as possible and increase those when needed.

r. Ismo

View solution in original post

isoutamo
SplunkTrust
SplunkTrust

Hi

when you are running Splunk on VMWare VM's I have seen that in some cases, if you have reserve too much vCores + mem for individual VMs, this could be the end result. You should remember that when VMWare schedules those nodes it needs to clean memory (on EXSi) etc. before it start to run "new" vm. Especially if you have overbooked you ESXi nodes as it's usually recommended you have "shoot your foot" ;-( with splunk nodes.

You should try to decrease number of vCores and memory if possible and then check the situation. Usually  I try to us as less resources on VMs as possible and increase those when needed.

r. Ismo

sokngoc
Explorer

Hello,

Interesting solution. I believe it might be worth a try. If it does solve the problem I will let you know.

However, I do see one possible fallacy to your solution. I may run into a situation where I cut the CPU cores in half and the VM continues to maximize the resource available. Then I would have to increate it back to where it was originally.

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Definitely something to try. And be sure that vCores vs. sockets vs. threads are defined reasonably (what ever it means in your environment). Switching from one socket to another or use threads from two or more sockets is more expensive than using those only from one in one vm.

Also remember that cleaning memory when switching content from one vm to another is quite expensive task. For that reason never overbook those resources for splunk nodes.

And last thing, ensure that you have enough IOPS on all nodes (especially in indexers) at the same time. It’s not enough that one peer will get 1200+ and other 200 as all those are needed at same time when search started!

If I recall right there are/were some VMware white papers which go through this more deeper level?

r. Ismo

thetech
Explorer

Hi,

Can you please describe your deployment? I.E What roles the servers are. how many indexers, are they clustered. Are you using ES/ISTI.

Regards

theTech

sokngoc
Explorer

Hi,

Thanks for responding, only 1 indexer.

I have four servers

1A - Cluster Master

2A - Indexer

3A - Heavy Forwarder

4A - Search Header

Using ES.

Please let me know if you have any recommendations.

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...