Deployment Architecture

splunkd.log filling with these errors?

mntbighker
Path Finder

Fresh install of Splunk 4.2 on CentOS. I'm testing with one client but have also tried many. I'm using splunktcp-ssl using light forwarding from Linux and Windows. I am getting data but since going to 4.2 I'm seeing these errors every 30 seconds for each client sending cooked data. If I downgrade the server to 4.1.7 I see the same errors from the 4.2 forwarders.

03-17-2011 13:49:30.072 -0700 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:37573. Success
03-17-2011 13:50:00.072 -0700 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:37666. Success
03-17-2011 13:50:30.072 -0700 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:37765. Success

Tweaking the heartbeat setting does not alter the 30 second timing.

Here is some DEBUG:

03-17-2011 12:41:07.871 DEBUG S2S - In doConsume for LengthReadingState
03-17-2011 12:41:07.871 DEBUG TcpChannel - Before accept
03-17-2011 12:41:07.871 DEBUG TcpChannel - Creating polled fd from factory
03-17-2011 12:41:07.871 DEBUG StatusMgr - Updating status for TcpInputProcessor
03-17-2011 12:41:07.871 INFO  StatusMgr - destPort=9979, eventType=connect_done, sourceHost=xxx.xxx.xxx.34, sourceIp=xxx.xxx.xxx.34, sourcePort=57370, statusee=TcpInputProcessor
03-17-2011 12:41:07.871 INFO  TcpInputConn - Connection in cooked mode from src=xxx.xxx.xxx.34:5737003-17-2011 12:41:07.872 DEBUG TcpChannel - adding connection to factory created fd = 0xa64e7860
03-17-2011 12:41:07.872 INFO  TcpChannel - Accepted connection
03-17-2011 12:41:07.880 DEBUG StatusMgr - Updating status for TcpInputProcessor
03-17-2011 12:41:07.880 INFO  StatusMgr - sourcePort=9979, ssl=true, statusee=TcpInputProcessor
03-17-2011 12:41:07.909 ERROR TcpInputProc - Error encountered for connection from src=xxx.xxx.xxx.34:57370. Success
03-17-2011 12:41:07.909 INFO  TcpInputConn - src=xxx.xxx.xxx.34:57370 closed connection
03-17-2011 12:41:07.909 DEBUG StatusMgr - Updating status for TcpInputProcessor
03-17-2011 12:41:07.910 INFO  StatusMgr - destPort=9979, eventType=connect_close, sourceHost=xxx.xxx.xxx.34, sourceIp=xxx.xxx.xxx.34, sourcePort=57370, statusee=TcpInputProcessor

And my input,output.conf:

[splunktcp-ssl:9979]

[SSL]
password = $1$+tCc8wYTRIqB
requireClientCert = false
rootCA = $SPLUNK_HOME/etc/auth/cacert.pem
serverCert = $SPLUNK_HOME/etc/auth/server.pem


[tcpout]
defaultGroup = Group1

[tcpout:Group1]
server = xxx.xxx.xxx.101:9979

[tcpout-server://xxx.xxx.xxx.101:9979]
sslCertPath = $SPLUNK_HOME/etc/auth/server.pem
sslPassword = $1$NPPqXQDYcSWN
sslRootCAPath = $SPLUNK_HOME/etc/auth/ca.pem
sslVerifyServerCert = false
Tags (2)
0 Karma

mntbighker
Path Finder

Add autoLB=false to senders outputs.conf although this is probably still a bug and it should not be necessary.

aafogles
Explorer

This is still an issue. I'm running 5.0.4. Not as benign as indicated either, bug causes tcp connections to open and close, delaying real-time traffic into the indexes and having UFs show up as "missing" in the Deployment Monitor

0 Karma

mntbighker
Path Finder

We have only one indexer now but may add another later. It only happens on 4.2 and it's both Windows and Linux here. It also happens with the LightForwarder package. As of today I finally have a useable "system". With a couple of "TRANSFORM=" statements in props.conf I'm not getting two copies of every server in Hosts so now I'm ready to learn how to actually make some use of the data.

0 Karma

mitch_1
Splunk Employee
Splunk Employee

That's probably an OK workaround if you're forwarding to only one server. If you have multiple indexers you really want autolb to be enabled.

I tried for a couple hours tonight to reproduce the problem between a few linux boxes here and have failed completely. However, I know someone else here who sees those messages all the time. I'll have to bug her on Monday. There must be something system or timing dependent about the issue.

0 Karma

mitch_1
Splunk Employee
Splunk Employee

Yes, we've also occasionally seen that in-house. I thought I had fixed it already but then it popped up again, so I guess I didn't find all the causes of it.

It's very interesting that it happens every 30 seconds for you. I didn't notice any particular timing when it was happening here. For us it turned up in a stress test where we were bombarding an indexer with lots of data from a few hundred forwarders. Maybe the sheer number of forwarders made it hard for us to see the timing pattern.

The good news is that it seems to be benign -- I haven't seen that message cause any operational problems, and the stress test didn't end up missing any events. I just need to find some time to investigate it further.

tmeader
Contributor

Actually, Splunk 4.2 is not required to see this error. We're seeing the same thing now with a 4.1.7 LWF sending autoLB to a pool of three 4.1.7 indexers. This Error...... Success message is showing up on all three indexers. But, as the original post stated, we're still getting the data alright. Very strange.

0 Karma

jrodman
Splunk Employee
Splunk Employee

Its the default because you can say server = server1, server2 and then you are balancing. I think you actually turn off a bunch of logic to sense servers in shutting-down state with this change, which means you get slightly worse behavior on server restart. In any event it certainly narrows the problem down, its the tester socket connections to see if the server is still accepting incoming connections (isnt shutting down, or backed up).

0 Karma

mntbighker
Path Finder

I added autoLB = false to outputs.conf and the problem disappears. Maybe that should be the default? What percentage of people do load balancing?

0 Karma

mntbighker
Path Finder

I could only find 2 things with 30 second default timings. A heartbeat interval and the load balance interval. I tried changing the heartbeat interval and nothing changed. This happens with one receiver and one sender by the way. It happens with 4.1.7 running on receiver and 4.2 running on sender. Is there a way to kill any load balancing functions on the senders since there is no load balancing going on here anyway?

jrodman
Splunk Employee
Splunk Employee

Error -> Success, to me, suggests a logic bug.

We saw some things like this on windows where it ended up feeling like an oddity in the TCP stack, so just suppressed complaining for errno 0.

This is for SSL, so its possible it may be an oddity in the OpenSSL interface. If you can test nonssl to see the behavior there, that might be interesting.

Probably we should a support ticket open. An strace -T -ttt -f -p <splunkd.pid> -o big_output on the receiving splunkd (thats where those errors are appearing, right?) for 15 minutes or so might be helpful.

0 Karma

mntbighker
Path Finder

I would need some level of assurance the trace was sent to a "dependable" recipient. It contains info considered sensitive by NASA.

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.