Getting Data In

How to troubleshoot blocked queues that are preventing data from being indexed?

arber
Communicator

Hello, currently im having a problem with the Splunk system we use. We collect data from other clients using syslog. The client send data to the splunk system via syslog and then the Splunk reads the content of the folder the data are stored. Today the system stopped indexing. we can see the logs still coming in the folder that splunk reads but they are not showed during the searches.
Searching the splunkd.log i found this:

09-17-2014 16:15:44.357 +0200 INFO  BatchReader - Could not send data to output queue (parsingQueue), retrying...

also in metrics.log

09-17-2014 16:03:33.811 +0200 INFO  Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=661, largest_size=1081, smallest_size=0
09-17-2014 16:03:33.811 +0200 INFO  Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=874, largest_size=1399, smallest_size=0
09-17-2014 16:04:10.812 +0200 INFO  Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=2826, largest_size=2855, smallest_size=735
09-17-2014 16:05:14.809 +0200 INFO  Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=739, largest_size=813, smallest_size=0
09-17-2014 16:06:16.811 +0200 INFO  Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=659, largest_si"

Any idea on how to get the queues unblocked ?

Regards

Arber

ltrand
Contributor

Within the UF you can manage queue size as below in the $SPLUNK/etc/system/local/server.conf file to increase the parsing queue:

[queue=parsingQueue] maxSize = 500 This is the default size
[queue=parsingQueue] maxSize = 10MB A reasonable size if watching a DNS server
[queue=parsingQueue] maxSize = 0 If you are crazy and want to allow unthrottled forwarding. USE WITH CARE

I would suggest identifying the servers that need this and define them as a server class so you can easily manage who has this setting and who has the default.

arber
Communicator

I changed the queue sizes. Made the parsingQueue from 6 MB to 20 MB and the aggQueue from 1 MB to 20 MB but still the queues are blocked. And the indexing for that specific file is stopped

0 Karma

ltrand
Contributor

Can you provide some log output from splunkd or metric in /splunk/var/log? We had to tune this several times as well and it was a balancing act. The queues will take RAM, so if you have plenty available feel free to crank it. Since I dont' know what logs from your router that your sending or what that rate is I can't suggest a good number to hit. For reference I have an IDS who's UF parsing queue is set to 300MB to make it work, but it's creating large amounts of logs so that's what it takes to keep up with the rate.

Think of it this way, if your trying to drain a basin that is filling at a rate of 5 gallons/min and you can only bail 3gallons/min you'll never keep up. So, when splunk is bailing out the logs, it needs to be at or better than the rate of incoming.

0 Karma

arber
Communicator

metrics.log

09-25-2014 16:09:12.905 +0200 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=6144, current_size_kb=6143, current_size=4790, largest_size=6126, smallest_size=3166
09-25-2014 16:09:12.905 +0200 INFO Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=615, largest_size=845, smallest_size=0
09-25-2014 16:09:12.905 +0200 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=644, largest_size=781, smallest_size=0
09-25-2014 16:09:56.924 +0200 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=1596, largest_size=1816, smallest_size=637
09-25-2014 16:10:36.903 +0200 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=1516, largest_si

I tried to increment the size this time to 100 MB.. so i made aggqueue 100 MB (even though in the log it is 1 MB) but still i would get something like INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=102400, current_size_kb=102300, current_size=1036648, largest_si

still the same thing queues are blocked. what is strange is that this is happening with just this file.. But before it was ok ..and im pretty sure that the log volume from this host is not changed

0 Karma

jagadeeshm
Contributor

So how did you eventually resolve this issue?

0 Karma

arber
Communicator

We created another folder on splunk that these devices send the logs to. After that applied the monitor on the new folder. That seemed to work

0 Karma

arber
Communicator

The problem is that the device is a juniper router ...so not possible to install the UF. When i search it using the SoS app i can see that that log file is in a status of ignored (reading batch file) while all the other files in thta folder that is beeing monitored are in a status of reading

0 Karma

ltrand
Contributor

It will ignore files that overload its buffer in an effort to preserve logging for the rest. The server.conf and limits.conf file should be edited at whatever point splunk is touching the data. So if you have syslog that is receiving the data and writing it to file on the indexer and the indexer is watching the local files, then that is where you change it.

0 Karma

linu1988
Champion

I had seen errors like this on splunk 5 indexers. Mostly after restarting the services it would continue without any issue. It may be due to all the network ports getting used up?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Go dig for messages about the actual reason though, preferably at the time indexing stopped for the first time. Those queues are only a symptom.

Alternatively, take a look at http://wiki.splunk.com/Community:HowIndexingWorks and see if queues further down suddenly aren't blocked. Then the processor after the bottom-most blocked queue might be to blame.

0 Karma

arber
Communicator

No disk space is ok.. The indexing is stop for only these clients that are using syslog to send data

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

The key question is, why did indexing stop? Blocked queues are usually just a symptom of something down the line not working properly, they're usually not a cause of anything.

Disk space would be a common issue... should be shown prominently in Splunk though.

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.