So, I've been searching for quite a while to figure out the issue on what I've been experiencing. Now at a loss I need to reach out and ask a question; so here it is!
Distributed Deployment, UF's forward to HF's that also listen to syslog before dumping into an Index. Version 5.0.5.
I have 3 boxes, the one with the least traffic correctly indexes all log files all of the time. The two heavy use boxes are not consistent in behavior. Three log files out of 12 in a directory do not consistently forward to my index in the heavy use boxes. Logs rotate every 30 minutes on all boxes. When looking at splunkd.log I see the following:
"BatchReader - Will retry path="../dns.log" after deferring for 10000ms, initCRC changed after being queued (before=0xfb006157e8ce2fdd, after=0x14244863e6b121b8). File growth rate must be higher than indexing or forwarding rate."
Also get the following:
07-08-2014 13:18:46.211 +0000 INFO BatchReader - Could not send data to output queue (parsingQueue), retrying...
07-08-2014 13:18:49.212 +0000 INFO BatchReader - Continuing...
currently local limits.conf is set to maxKBps = 131072 on one machine, maxKBps = 0 on the other.
Parsing queue from the metrics.log looks like the following:
INFO Metrics - group=queue, name=parsingqueue, max_size_kb=512, current_size_kb=449, current_size=7, largest_size=7, smallest_size=3
I see no messages about stuck or full queues. "splunk btools" and "splunk list monitor" show that the files are always watched and never dropped, but the data never gets to the indexes. I'm totally stuck on what the issue could be as I'm not getting full queue messages in the metrics.log but I do see backup messages in splunkd.log. However I thought that even if they backoff for 1000ms that the data would eventually get there.
The logs are generating about 5mbps in disk I/O. Splunk is currently generating peaks of 24MB of network traffic with an average of 11MPBS. CPU usage is low, but my memory usage is high (from the app I'm trying to grab logs from).
What should I be looking at for next troubleshooting steps? What could be the causes for the data not indexing some times, but not all the times?
You'd have to increase your parsing queue in order to keep up with that rate of kbps. Otherwise it'll stay queued up regardless of how wide you make the pipe the doorway (the parsing queue) on the forwarders needs to be increased as well.
One step further is increasing the number of pipelines if your event data is huge. This will allow multiple routes for your data. All at a cost of resources. Splunk may eat up more CPUs and add to your memory consumption on your monitoring systems
Ok, so it is not a forwarding speed issue.
If you they are Universal or LightWeight Forwarders, this is likely not an indexing limit at their level.
Please check the indexers, they may be overloaded. Use the SOS app on the indexers to check the indexing queues.