We have deployed Job Scheduler, Indexer, Search Head and Forwarder on Virtual Machines. Often we see issues like: 1. Indexer is down. Unable to distribute to peer. 2. Crash logs in indexer. 3. Splunk stops running in Job Scheduler node. 4. Many processes of Splunk helpers running(PIDs increase drastically to 15 and then fluctuate between 5 to 10).
Earlier we did not have all these issues. Recently we see that network is slow and sometimes it is unresponsive for couple of minutes (It takes more than a minute to establish connection to VM).
What kind of issues may arise in splunk system due to slow nature of network. Whether the issues that i have mentioned are due to network speed being slow.
There is a Splunk App For Boundary that may be useful...
Boundary monitors all of the network-flows between all of the VMs - its useful for identifying if a hotspots is in the App, VM or Network.
Ok, there are a few pointers to go through here;
Everytime you run a search it will spawn a new Splunkd process, each process will consume a CPU core. From what you've said you need to be sure that you can support up to 15+ instances. It is most likely caused by users running searches combined with scheduled searches you may have.
The old adage that running Splunk on VM's is bad is a little out of date nowadays. If deployed correctly you won't experience too many issues, the key problem is ensuring that the indexer can get about 800 (Really I'd aim for 1200) IOPS on whatever storage it has, giving it native read/write access to a disk can help improve this. VM's are great for Splunk because they make the deployment of Search Heads quite simple, just ensure it has some IO and then load up the CPU's.
Finally, are these all on the same host? How many network ports does the box have? If its a single port and you've got data flowing over it into the indexer, perhaps tcp acks coming back out, users connecting to run searches etc.. its quite possible you're running it near capacity. Have you checked any of the box performance metrics?
Yes, they are all on same ESX server. The ESX box has 2 ports. When i checked performance metrics of the box, for most part of the time it is around 20%. Some times, there is a steep rise and it is at 98.77% memory usage. That time, i see crash logs in indexer.