What would cause a blocked queue on an ec2 host an...

sjovang · ‎06-09-2015

We have ~50 hosts that are placed on various locations outside our data center. To receive logs from these hosts we have setup a virtual machine on ec2 to relay the logs to our Splunk Platform.

From time to time we see that the amount of indexed data drops (the number of events is more or less the same since the bulk of the data is perfmon-events from Windows Servers). When this happens we can see in metrics.log (on the ec2 host) that some of the queues are blocked:

05-28-2015 21:00:52.585 +0200 INFO  Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1845, largest_size=1845, smallest_size=1845
05-28-2015 21:00:52.585 +0200 INFO  Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=1848, largest_size=1848, smallest_size=1848

Restarting splunk solves the issue, but it returns after a random amount of days.

I'm trying to grasp how the queues works from http://wiki.splunk.com/Community:HowIndexingWorks, and as far as I understand the indexqueue is what writes data to disk?

Even if we're "just" relaying data through the ec2-host does all the logdata pass through each queue, get written to disk, and then forwarded?

Because the issue is not constant, but appear on random times, I suspect the root cause might be problems with ec2 (amazon doing maintenance without our knowledge, high load on the underlying ebs volumes etc) degrading performance. Am I on the right track here, or are there other reasons that are more likely?

Is there any tuning that could be done to omit these issues, or is it just a "throw more/better hw at it"-problem?

We would really appreciate some feedback. We want to start indexing our IIS logs as well. This will significantly increase the volume of events indexed, but we can't enable it before we're sure our architecture is stable

esix_splunk · ‎06-09-2015

So a few things are happening:

maxkbps is limited at 500, this means that you're forwarder / ec2 instance is limited the output to 500kbs. This isnt a bad thing, but depending on your traffic volumes and network topology, this may need to be tuned.

Your typing and indexing queues are getting blocked. What this can mean is that you are i/o constrained on the indexer and / or you have some bad regexes, or alot of regexes running.

You should look at more indexers or more HF.

What would cause a blocked queue on an ec2 host and is there any tuning that can solve and prevent this issue?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

What would cause a blocked queue on an ec2 host and is there any tuning that can solve and prevent this issue?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases