I know queue backlog troubleshooting questions are very common but I'm stumped here.
I have 2 Universal Forwarders forwarding locally monitored log files (populated by syslog-ng forwarding) over TCP to 4 load-balanced Heavy Forwarders, which then send them to a cluster of 8 indexers. These Universal Forwarders are processing a lot of data, approximately 500 MB per minute each, but this setup worked without lag or dropped logs up until recently. The disk IO and network speeds should be easily able to handle this volume.
However, recently, the Universal Forwarders have stopped forwarding logs and the only WARN/ERROR logs in splunk.d are as follows:
08-25-2023 14:40:09.249 -0400 WARN TailReader [25677 tailreader0] - Could not send data to output queue (parsingQueue), retrying...
And then, generally some seconds later:
08-25-2023 14:41:04.250 -0400 INFO TailReader [25677 tailreader0] - ...continuing.
My question is this: assuming there's no bottleneck in the TCP output to the 4 HFs, what "parsing" exactly is being done on these logs that would cause the parsingQueue to fill up? I've looked at just about every UF parsingQueue related question on Splunk to find an answer, and I've addressed some common gotchas:
- maxKBps in the [thruput] stanza in limits.conf is set to 0 (unlimited thruput)
- 2 parallel parsing queues, each with 1 GB of storage (higher than recommended but I was desperate)
- no INDEXED_EXTRACTIONS for anything but Splunk's preconfigured internal logs
I've taken some other steps as well:
- set up logrotate on the monitored files to rotate any file that gets larger than 1GB, so Splunk isn't monitoring exceptionally large files
- set "DATETIME_CONFIG = NONE" and "SHOULD_LINEMERGE = false" for all non-internal sourcetypes
I don't understand why the parsingQueue would fill up when it doesn't seem like there's any parsing actions configured on this machine! Is anyone able to advise me on what to look for or change to resolve this?
Thanks much.