Most of our systems use rsyslog for logging, and log their events over UDP to a central splunk server. This works fine.
One of our groups of users have their own splunk server, and wanted to log separately and in more detail to that for certain applications of their own. They have configured their rsyslog setup to forward over TCP, not UDP, with the following rules in our rsyslog configuration:
:syslogtag, startswith, "psd_" @@psd1d:5140;RSYSLOG_ForwardFormat
:syslogtag, startswith, "psd_" ~
Most of the time, this works fine. Once in a blue moon, however, something very strange happens. The web servers which are creating these log events -- the application is a Rails app -- suddenly start going very slowly. On investigation, they find that the calls to the logger are going really slowly. Worse than that, the events are no longer being logged, anywhere.
Running an strace on the rsyslogd shows that it's doing nothing at all. Restarting the rsyslogd seems to wake things up and get things going. I suspect that restarting splunkd would also fix it, since that would also cause the TCP connection to drop, although I haven't tried that yet.
One thing we do know, is that it tends to happen to several machines simultaneously, which rather points to it being a problem at the splunk end rather than at the rsyslog end.
We've never seen this with UDP forwarding, but the users don't want UDP forwarding because they're scared of losing events, or getting events in the wrong order.
My current theory is that something's causing the TCP connections to get into a wedged state, but I have no idea what. Has anyone seen anything like this before?
Thanks,
Tim
... View more