The problem here ended up being that we had a useACK = true in an outputs.conf file without a stanza tag above it. Therefore, it applied to all output, including the [syslog] output. A syslog server will not send back an ACK. Splunk will wait 2 seconds for each event and then send the event anyway (based on our observations). We added a [tcpout] above the "useACK = true" setting so it would apply only to tcpout and not to syslog output and that fixed this.
Early in the troubleshooting, we did hit on this setting. We added a useACK = false to the syslog stanza, but that still doesn't disable the useACK apparently. I've even explicitly tried again to set that to false under the syslog stanza but it doesn't seem to matter. If it's set to true globally, that seems to take affect.
I still can't explain why maybe 1 out of 10 times we restarted, it would work just fine even though this config error was still present.
Thanks to Jack Herod from Splunk support for finally finding this configuration error. If you're at .conf, I owe you a beer.
... View more