Ever since we added a few more Splunk Forwarders to our environment, the Splunk Server (search head, indexer, deployment server, Windows box) stopped accepting connections from the Forwarders.
We have around 30 forwarders total, all going to the Splunk server.
Splunk server is now 4.3.2 and no change. Restarting the Splunk server helps for about 2 minutes, then the agents reconnect and then end up in a failed state after a couple minutes.
Forwarder splunkd.log shows:
06-06-2012 11:27:11.884 -0700 INFO TcpOutputProc - Connected to idx=splunkserver:9997
06-06-2012 11:27:11.885 -0700 INFO TcpOutputProc - Connected to idx=splunkserver:9997
06-06-2012 11:28:03.981 -0700 INFO BatchReader - Removed from queue file='/opt/splunkforwarder/var/log/splunk/metrics.log.2'.
06-06-2012 11:29:41.070 -0700 INFO BatchReader - Removed from queue file='/opt/splunkforwarder/var/log/splunk/metrics.log.5'.
06-06-2012 11:29:55.226 -0700 WARN TcpOutputFd - Connect to splunkserver:9997 failed. Connection refused
06-06-2012 11:29:55.226 -0700 ERROR TcpOutputFd - Connection to host=splunkserver:9997 failed
06-06-2012 11:29:55.226 -0700 WARN TcpOutputFd - Connect to splunkserver:9997 failed. Connection refused
06-06-2012 11:29:55.226 -0700 ERROR TcpOutputFd - Connection to host=splunkserver:9997 failed
06-06-2012 11:29:55.226 -0700 INFO TcpOutputProc - Detected connection to splunkserver:9997 closed
06-06-2012 11:29:55.226 -0700 INFO TcpOutputProc - Detected connection to splunkserver:9997 closed
06-06-2012 11:29:56.553 -0700 WARN TcpOutputFd - Connect to splunkserver:9997 failed. Connection refused
06-06-2012 11:29:56.553 -0700 ERROR TcpOutputFd - Connection to host=splunkserver:9997 failed
06-06-2012 11:29:56.553 -0700 WARN TcpOutputFd - Connect to splunkserver:9997 failed. Connection refused
06-06-2012 11:29:56.553 -0700 ERROR TcpOutputFd - Connection to host=splunkserver:9997 failed
06-06-2012 11:29:56.553 -0700 WARN TcpOutputProc - Applying quarantine to idx=splunkserver:9997 numberOfFailures=2
06-06-2012 11:29:56.553 -0700 WARN TcpOutputProc - Applying quarantine to idx=splunkserver:9997 numberOfFailures=2
06-06-2012 11:30:25.221 -0700 INFO TcpOutputProc - Removing quarantine from idx=splunkserver:9997
Splunk Server splunkd.log doesnt show much related to the inbound connections. Perhaps a debug flag needs to be set?
Any ideas?
Solution found!
Etc/system/local/inputs.conf
[splunktcp://9997]
connection_host = none
restart splunk server and its fixed. DNS was holding it all up.
Solution found!
Etc/system/local/inputs.conf
[splunktcp://9997]
connection_host = none
restart splunk server and its fixed. DNS was holding it all up.
Hello Team,
I did same, you all suggested, but it doesn't work me
Etc/system/local/inputs.conf
[splunktcp://9997]
connection_host = none
Any other work around?
Regard
Steven
Where to keep these settings?
My 2 Heavy forwarders, cluster master or all of my 10 indexers?
@BP9906 @lrudolph @msclimenti
From the documentation it says it can be put at various levels in the inputs.conf.
I find it easier to set connection_host = ip since it does not perform reverse dns lookup and you get the IP if the hostname is not provided via the splunkforwarder (ie if its syslog or something).
To answer your question, you would want to review the connection_host setting on any receiving end which would be your heavy forwarders and indexers.
On the indexers.
Did you ever find out why DNS resolution became a problem?
Not sure how you figured this out but thanks a ton!!!
I thought I'd also add that telnet splunkserver 9997 shows connection refused.
When I'm on the splunkserver box directly and do telnet localhost 9997 I get the same. Netstat -ano revals its listening on 9997 and has splunkd.exe as the PID owning the port.
Yep, that's a "Me too". This little gem was causing all types of slowness on the delivery of events and the unpredicatble connection of UFs. Adding SSL to the UF-HF connection seems to make it even worse. UF's complained
Connect to x.x.x.x:9997 failed. No connection could be made because the target machine actively refused it
Connection to host=x.x.x.x:9997 failed
Cooked connection to ip=x.x.x.x:9997 timed out
Thanks ...Laurie:{)
Yeah! This was finally the solution to my problem, too. Our forwarders showed a lot of "WARN TcpOutputProc - Cooked connection to ip=x.x.x.x:9997 timed out"-messages in the logs. Finally, we lost data, even with two indexers and useACK=true in place. We could trace it back to the not configured connection_host
-setting of the indexers which defaulted to "dns". Since we don't use a DNS-Server in out network, the number of forwarders we deployed finally slowed everything down and finally lead to data which couldn't be indexed. connection_host = none
solved it all.
Thank you!
The Connection_Host setting, where is that and in this case was it on the indexer/s or the forwarder that you changed it ?
This is a setting on the indexers.
Yep, and that window is upon restart of the Splunk server (ie splunk.exe restart command). After that short window, all the forwarders stop receiving.
So is there some window in which telnet splunkserver 9997 does work?
Windows Firewall is allowed, especially since the agents connect after I restart the Splunk Indexer (splunk.exe restart). After 2-4 minutes of the splunk indexer restart, they disconnect, connections are refused, then after about 5 minutes, the splunk server starts accepting the tcp connection again, but no data is being received by the indexer.
I have the same issue. What was your resolution ? I'm on 6.1.5 now.
Firewalls in play?