We just purchased Splunk and decided to roll it out into a virtual environment. As we rolled out forwarders to our Servers once we hit about 300 servers we saw latency on our NetApp SAN's spike every 40-45 minutes. As part of our troubleshooting since rollout of Splunk was one of the recent changes we turned off the virtual nic and the spikes stopped. Does anyone have any idea what process of spunk would cause it? I originally thought it was as new devices checking in but the spikes continued for a week after devices were onboarded.
Interesting issue. I'm assuming that by turning off the "virtual nic" you stopped the forwarders from talking to the Splunk indexer(s) and the problem went away. The Splunk servers are going to add some load to the storage and each forwarder will hit the disk just a little bit (reading log files and writing its own logging and state info), but an interval of 40-45 minutes doesn't really stand out as a Splunk activity. Questions that come to mind:
Is the Splunk infrastructure using the same storage as the 300 servers with forwarders installed?
Do the latency spikes appear to correspond to events on your systems that might result in a storm of indexing activity?
What OS do the forwarders run and what are they configured to index?
Did the Splunk server work OK in terms of running searches while this was going on?
If possible, I would probably try to get a fraction of the forwarders sending data again (maybe the ones that you'd expect to be the busiest) and then watch the disk activity (as reported by your VM infrastructure) for the forwarding and Splunk infrastructure hosts. The interesting part is the 40-45 minute periodic activity and my goal would be to find out where that's coming from; I'm guessing it will appear to at least a small extent with fewer forwarders enabled.
Thanks esix! We did do that, I was told from my netapp team after we did it that it resolved the issue but it actually didn't. After I left the problem continued to persist even after splunk was effectively shut off and it turned out to be another system, luckily. When they first told me it was splunk it made absolutely no sense but it was one of the only global system changes that had occurred, but we figured it out in the end. Thanks for responding though!