Hello, currently im having a problem with the Splunk system we use. We collect data from other clients using syslog. The client send data to the splunk system via syslog and then the Splunk reads the content of the folder the data are stored. Today the system stopped indexing. we can see the logs still coming in the folder that splunk reads but they are not showed during the searches.
Searching the splunkd.log i found this:
09-17-2014 16:15:44.357 +0200 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
also in metrics.log
09-17-2014 16:03:33.811 +0200 INFO Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=661, largest_size=1081, smallest_size=0 09-17-2014 16:03:33.811 +0200 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=874, largest_size=1399, smallest_size=0 09-17-2014 16:04:10.812 +0200 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=2826, largest_size=2855, smallest_size=735 09-17-2014 16:05:14.809 +0200 INFO Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=739, largest_size=813, smallest_size=0 09-17-2014 16:06:16.811 +0200 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=659, largest_si"
Any idea on how to get the queues unblocked ?
Within the UF you can manage queue size as below in the $SPLUNK/etc/system/local/server.conf file to increase the parsing queue:
[queue=parsingQueue] maxSize = 500 This is the default size
[queue=parsingQueue] maxSize = 10MB A reasonable size if watching a DNS server
[queue=parsingQueue] maxSize = 0 If you are crazy and want to allow unthrottled forwarding. USE WITH CARE
I would suggest identifying the servers that need this and define them as a server class so you can easily manage who has this setting and who has the default.
I changed the queue sizes. Made the parsingQueue from 6 MB to 20 MB and the aggQueue from 1 MB to 20 MB but still the queues are blocked. And the indexing for that specific file is stopped
Can you provide some log output from splunkd or metric in /splunk/var/log? We had to tune this several times as well and it was a balancing act. The queues will take RAM, so if you have plenty available feel free to crank it. Since I dont' know what logs from your router that your sending or what that rate is I can't suggest a good number to hit. For reference I have an IDS who's UF parsing queue is set to 300MB to make it work, but it's creating large amounts of logs so that's what it takes to keep up with the rate.
Think of it this way, if your trying to drain a basin that is filling at a rate of 5 gallons/min and you can only bail 3gallons/min you'll never keep up. So, when splunk is bailing out the logs, it needs to be at or better than the rate of incoming.
09-25-2014 16:09:12.905 +0200 INFO Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=6144, current_size_kb=6143, current_size=4790, largest_size=6126, smallest_size=3166
09-25-2014 16:09:12.905 +0200 INFO Metrics - group=queue, name=splunktcpin, blocked=true, max_size_kb=500, current_size_kb=499, current_size=615, largest_size=845, smallest_size=0
09-25-2014 16:09:12.905 +0200 INFO Metrics - group=queue, name=typingqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=644, largest_size=781, smallest_size=0
09-25-2014 16:09:56.924 +0200 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=1596, largest_size=1816, smallest_size=637
09-25-2014 16:10:36.903 +0200 INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=1024, current_size_kb=1023, current_size=1516, largest_si
I tried to increment the size this time to 100 MB.. so i made aggqueue 100 MB (even though in the log it is 1 MB) but still i would get something like INFO Metrics - group=queue, name=aggqueue, blocked=true, max_size_kb=102400, current_size_kb=102300, current_size=1036648, largest_si
still the same thing queues are blocked. what is strange is that this is happening with just this file.. But before it was ok ..and im pretty sure that the log volume from this host is not changed
So how did you eventually resolve this issue?
We created another folder on splunk that these devices send the logs to. After that applied the monitor on the new folder. That seemed to work
The problem is that the device is a juniper router ...so not possible to install the UF. When i search it using the SoS app i can see that that log file is in a status of ignored (reading batch file) while all the other files in thta folder that is beeing monitored are in a status of reading
It will ignore files that overload its buffer in an effort to preserve logging for the rest. The server.conf and limits.conf file should be edited at whatever point splunk is touching the data. So if you have syslog that is receiving the data and writing it to file on the indexer and the indexer is watching the local files, then that is where you change it.
I had seen errors like this on splunk 5 indexers. Mostly after restarting the services it would continue without any issue. It may be due to all the network ports getting used up?
Go dig for messages about the actual reason though, preferably at the time indexing stopped for the first time. Those queues are only a symptom.
Alternatively, take a look at http://wiki.splunk.com/Community:HowIndexingWorks and see if queues further down suddenly aren't blocked. Then the processor after the bottom-most blocked queue might be to blame.
No disk space is ok.. The indexing is stop for only these clients that are using syslog to send data
The key question is, why did indexing stop? Blocked queues are usually just a symptom of something down the line not working properly, they're usually not a cause of anything.
Disk space would be a common issue... should be shown prominently in Splunk though.