We have ~50 hosts that are placed on various locations outside our data center. To receive logs from these hosts we have setup a virtual machine on ec2 to relay the logs to our Splunk Platform.
From time to time we see that the amount of indexed data drops (the number of events is more or less the same since the bulk of the data is perfmon-events from Windows Servers). When this happens we can see in metrics.log (on the ec2 host) that some of the queues are blocked:
Even if we're "just" relaying data through the ec2-host does all the logdata pass through each queue, get written to disk, and then forwarded?
Because the issue is not constant, but appear on random times, I suspect the root cause might be problems with ec2 (amazon doing maintenance without our knowledge, high load on the underlying ebs volumes etc) degrading performance. Am I on the right track here, or are there other reasons that are more likely?
Is there any tuning that could be done to omit these issues, or is it just a "throw more/better hw at it"-problem?
We would really appreciate some feedback. We want to start indexing our IIS logs as well. This will significantly increase the volume of events indexed, but we can't enable it before we're sure our architecture is stable
maxkbps is limited at 500, this means that you're forwarder / ec2 instance is limited the output to 500kbs. This isnt a bad thing, but depending on your traffic volumes and network topology, this may need to be tuned.
Your typing and indexing queues are getting blocked. What this can mean is that you are i/o constrained on the indexer and / or you have some bad regexes, or alot of regexes running.