Fresh install of Splunk 4.2 on CentOS. I'm testing with one client but have also tried many. I'm using splunktcp-ssl using light forwarding from Linux and Windows. I am getting data but since going to 4.2 I'm seeing these errors every 30 seconds for each client sending cooked data. If I downgrade the server to 4.1.7 I see the same errors from the 4.2 forwarders.
03-17-2011 13:49:30.072 -0700 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:37573. Success 03-17-2011 13:50:00.072 -0700 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:37666. Success 03-17-2011 13:50:30.072 -0700 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:37765. Success
Tweaking the heartbeat setting does not alter the 30 second timing.
Here is some DEBUG:
03-17-2011 12:41:07.871 DEBUG S2S - In doConsume for LengthReadingState 03-17-2011 12:41:07.871 DEBUG TcpChannel - Before accept 03-17-2011 12:41:07.871 DEBUG TcpChannel - Creating polled fd from factory 03-17-2011 12:41:07.871 DEBUG StatusMgr - Updating status for TcpInputProcessor 03-17-2011 12:41:07.871 INFO StatusMgr - destPort=9979, eventType=connect_done, sourceHost=xxx.xxx.xxx.34, sourceIp=xxx.xxx.xxx.34, sourcePort=57370, statusee=TcpInputProcessor 03-17-2011 12:41:07.871 INFO TcpInputConn - Connection in cooked mode from src=xxx.xxx.xxx.34:5737003-17-2011 12:41:07.872 DEBUG TcpChannel - adding connection to factory created fd = 0xa64e7860 03-17-2011 12:41:07.872 INFO TcpChannel - Accepted connection 03-17-2011 12:41:07.880 DEBUG StatusMgr - Updating status for TcpInputProcessor 03-17-2011 12:41:07.880 INFO StatusMgr - sourcePort=9979, ssl=true, statusee=TcpInputProcessor 03-17-2011 12:41:07.909 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:57370. Success 03-17-2011 12:41:07.909 INFO TcpInputConn - src=xxx.xxx.xxx.34:57370 closed connection 03-17-2011 12:41:07.909 DEBUG StatusMgr - Updating status for TcpInputProcessor 03-17-2011 12:41:07.910 INFO StatusMgr - destPort=9979, eventType=connect_close, sourceHost=xxx.xxx.xxx.34, sourceIp=xxx.xxx.xxx.34, sourcePort=57370, statusee=TcpInputProcessor
And my input,output.conf:
[splunktcp-ssl:9979] [SSL] password = $1$+tCc8wYTRIqB requireClientCert = false rootCA = $SPLUNK_HOME/etc/auth/cacert.pem serverCert = $SPLUNK_HOME/etc/auth/server.pem [tcpout] defaultGroup = Group1 [tcpout:Group1] server = xxx.xxx.xxx.101:9979 [tcpout-server://xxx.xxx.xxx.101:9979] sslCertPath = $SPLUNK_HOME/etc/auth/server.pem sslPassword = $1$NPPqXQDYcSWN sslRootCAPath = $SPLUNK_HOME/etc/auth/ca.pem sslVerifyServerCert = false
This is still an issue. I'm running 5.0.4. Not as benign as indicated either, bug causes tcp connections to open and close, delaying real-time traffic into the indexes and having UFs show up as "missing" in the Deployment Monitor
We have only one indexer now but may add another later. It only happens on 4.2 and it's both Windows and Linux here. It also happens with the LightForwarder package. As of today I finally have a useable "system". With a couple of "TRANSFORM=" statements in props.conf I'm not getting two copies of every server in Hosts so now I'm ready to learn how to actually make some use of the data.
That's probably an OK workaround if you're forwarding to only one server. If you have multiple indexers you really want autolb to be enabled.
I tried for a couple hours tonight to reproduce the problem between a few linux boxes here and have failed completely. However, I know someone else here who sees those messages all the time. I'll have to bug her on Monday. There must be something system or timing dependent about the issue.
Yes, we've also occasionally seen that in-house. I thought I had fixed it already but then it popped up again, so I guess I didn't find all the causes of it.
It's very interesting that it happens every 30 seconds for you. I didn't notice any particular timing when it was happening here. For us it turned up in a stress test where we were bombarding an indexer with lots of data from a few hundred forwarders. Maybe the sheer number of forwarders made it hard for us to see the timing pattern.
The good news is that it seems to be benign -- I haven't seen that message cause any operational problems, and the stress test didn't end up missing any events. I just need to find some time to investigate it further.
Actually, Splunk 4.2 is not required to see this error. We're seeing the same thing now with a 4.1.7 LWF sending autoLB to a pool of three 4.1.7 indexers. This Error...... Success message is showing up on all three indexers. But, as the original post stated, we're still getting the data alright. Very strange.
Its the default because you can say server = server1, server2 and then you are balancing. I think you actually turn off a bunch of logic to sense servers in shutting-down state with this change, which means you get slightly worse behavior on server restart. In any event it certainly narrows the problem down, its the tester socket connections to see if the server is still accepting incoming connections (isnt shutting down, or backed up).
I could only find 2 things with 30 second default timings. A heartbeat interval and the load balance interval. I tried changing the heartbeat interval and nothing changed. This happens with one receiver and one sender by the way. It happens with 4.1.7 running on receiver and 4.2 running on sender. Is there a way to kill any load balancing functions on the senders since there is no load balancing going on here anyway?
Error -> Success, to me, suggests a logic bug.
We saw some things like this on windows where it ended up feeling like an oddity in the TCP stack, so just suppressed complaining for errno 0.
This is for SSL, so its possible it may be an oddity in the OpenSSL interface. If you can test nonssl to see the behavior there, that might be interesting.
Probably we should a support ticket open. An
strace -T -ttt -f -p <splunkd.pid> -o big_output on the receiving splunkd (thats where those errors are appearing, right?) for 15 minutes or so might be helpful.