Solved: Virtualized Splunk causing netapp latency

rmf185039 · ‎09-08-2016

We just purchased Splunk and decided to roll it out into a virtual environment. As we rolled out forwarders to our Servers once we hit about 300 servers we saw latency on our NetApp SAN's spike every 40-45 minutes. As part of our troubleshooting since rollout of Splunk was one of the recent changes we turned off the virtual nic and the spikes stopped. Does anyone have any idea what process of spunk would cause it? I originally thought it was as new devices checking in but the spikes continued for a week after devices were onboarded.

Thanks,
Ryan

esix_splunk · ‎09-15-2016

There are a lot of considerations for using Virtual Instances and Shared storage. Can you elaborate more on what exactly you mean? E.g., Architecturally what does your topology look like..

1) Are your indexers all using the shared storage, and the virtual NICs on this are causing the issues
2) Are your forwarders all running of the Netapp?
3) Linux or Windows?

Thanks

View solution in original post

esix_splunk · ‎09-15-2016

There are a lot of considerations for using Virtual Instances and Shared storage. Can you elaborate more on what exactly you mean? E.g., Architecturally what does your topology look like..

1) Are your indexers all using the shared storage, and the virtual NICs on this are causing the issues
2) Are your forwarders all running of the Netapp?
3) Linux or Windows?

Thanks

jtacy · ‎09-15-2016

Interesting issue. I'm assuming that by turning off the "virtual nic" you stopped the forwarders from talking to the Splunk indexer(s) and the problem went away. The Splunk servers are going to add some load to the storage and each forwarder will hit the disk just a little bit (reading log files and writing its own logging and state info), but an interval of 40-45 minutes doesn't really stand out as a Splunk activity. Questions that come to mind:

Is the Splunk infrastructure using the same storage as the 300 servers with forwarders installed?
Do the latency spikes appear to correspond to events on your systems that might result in a storm of indexing activity?
What OS do the forwarders run and what are they configured to index?
Did the Splunk server work OK in terms of running searches while this was going on?

If possible, I would probably try to get a fraction of the forwarders sending data again (maybe the ones that you'd expect to be the busiest) and then watch the disk activity (as reported by your VM infrastructure) for the forwarding and Splunk infrastructure hosts. The interesting part is the 40-45 minute periodic activity and my goal would be to find out where that's coming from; I'm guessing it will appear to at least a small extent with fewer forwarders enabled.

rmf185039 · ‎09-16-2016

Thanks esix! We did do that, I was told from my netapp team after we did it that it resolved the issue but it actually didn't. After I left the problem continued to persist even after splunk was effectively shut off and it turned out to be another system, luckily. When they first told me it was splunk it made absolutely no sense but it was one of the only global system changes that had occurred, but we figured it out in the end. Thanks for responding though!

-Ryan

Virtualized Splunk causing netapp latency

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25

Are you a member of the Splunk Community?

Virtualized Splunk causing netapp latency

Dashboards: Hiding charts while search is being executed and other uses for tokens

Splunk Observability Cloud's AI Assistant in Action Series: Explaining Metrics and ...

Brains, Bytes, and Boston: Learn from the Best at .conf25