We were running some load over the weekend, and ran into an issue where one of our Forwarder nodes went unresponsive. We are now attributing it to a large mazQueueSize in outputs.conf, all Indexer nodes unreachable, and splunkd consuming all available memory. In the problem case, our maxQueueSize was set to 1000000 and a splunkd process was (in a recorded snapshot) seen consuming 3GB:
Yes that behaviour is expected. maxQueueSize controls the number of events that can be stored in memory at any point in time, and increasing it doesn't necessarily mean indexing will work any faster or more efficiently. If the connection between an indexer and a forwarder goes down, the intended behaviour is for the fowarder to fill up it's queues with data ready to send, and then block any more incoming data from file, or from a network device. If the value is set too high, that will result in high resource consumption in the event of a problem/disconnect.
Generally, if your deployment is performing well, there's no reason to increase this beyond the default, as it should never even get as high as 1000. If you were receiving UDP data on your forwarder however, and it was imperative you captured as much as possible when this happens, that would be a reason to increase it to a high number. In the case that data retention was a priority however, I would question the suitability of using UDP in the first place.