Frequently, our lightweight forwarders cannot connect to the Splunk server to send log tail output and we end up missing/losing those logs because they roll frequently. The client/forwarder will just log:
06-21-2010 16:24:36.454 WARN TcpOutputProc - Failed to make a connection, will retry.
06-21-2010 16:24:56.495 INFO TcpOutputProc - Retrying connection to X.X.X.X:7080...
06-21-2010 16:24:56.496 WARN TcpOutputProc - Failed to make a connection, will retry.
If I restart the splunk server, then it starts receiving data again. I've already set it up to restart twice a day, but that is not enough.
The Splunk (4.1.3) server is Solaris 11 and a netstat -iv shows it has about 1-2 ierrs every 10 seconds, although the network folks say the switch port shows clean. I've attempted to tune TCP on the server, but it's made no difference.
Also, when it stops receiving data, it will have a screen full of connections to localhost on the mgmt port shown in CLOSE_WAIT state similar to this: