We were running some load over the weekend, and ran into an issue where one of our Forwarder nodes went unresponsive. We are now attributing it to a large mazQueueSize in outputs.conf, all Indexer nodes unreachable, and splunkd consuming all available memory. In the problem case, our maxQueueSize was set to 1000000 and a splunkd process was (in a recorded snapshot) seen consuming 3GB:
maxQueueSize=1000000 8947 root 15 0 3482m 3.1g 7300 S 2.0 39.4 0:35.23 splunkd
On investigation, I restarted splunkd with varying values for mazQueueSize - 10,000; 1,000; and 100 with corresponding reduction in memory consumption:
maxQueueSize=10000
11164 root 15 0 2233m 2.1g 7228 S 0.0 26.8 0:10.03 splunkd
maxQueueSize=1000
11440 root 15 0 394m 292m 7192 S 0.0 3.7 0:05.34 splunkd
maxQueueSize=100
11520 root 15 0 209m 107m 7188 S 0.0 1.4 0:04.50 splunkd
A few questions:
The million entry was produced trying to maximize indexing efficiency, and we are going back to the default. What does varying maxQueueSize do for us?
Is it expected that a large maxQueueSize can cause splunkd to consume all memory? Is there any sort of safety shut-off that should kick in?
Thanks,
Yes that behaviour is expected. maxQueueSize
controls the number of events that can be stored in memory at any point in time, and increasing it doesn't necessarily mean indexing will work any faster or more efficiently. If the connection between an indexer and a forwarder goes down, the intended behaviour is for the fowarder to fill up it's queues with data ready to send, and then block any more incoming data from file, or from a network device. If the value is set too high, that will result in high resource consumption in the event of a problem/disconnect.
Generally, if your deployment is performing well, there's no reason to increase this beyond the default, as it should never even get as high as 1000. If you were receiving UDP data on your forwarder however, and it was imperative you captured as much as possible when this happens, that would be a reason to increase it to a high number. In the case that data retention was a priority however, I would question the suitability of using UDP in the first place.
Yes that behaviour is expected. maxQueueSize
controls the number of events that can be stored in memory at any point in time, and increasing it doesn't necessarily mean indexing will work any faster or more efficiently. If the connection between an indexer and a forwarder goes down, the intended behaviour is for the fowarder to fill up it's queues with data ready to send, and then block any more incoming data from file, or from a network device. If the value is set too high, that will result in high resource consumption in the event of a problem/disconnect.
Generally, if your deployment is performing well, there's no reason to increase this beyond the default, as it should never even get as high as 1000. If you were receiving UDP data on your forwarder however, and it was imperative you captured as much as possible when this happens, that would be a reason to increase it to a high number. In the case that data retention was a priority however, I would question the suitability of using UDP in the first place.