HI All, After recent upgrade to the latest splunk version 6.6.1 we could see some of the splunk agent throwing some errors in splunkd.log, when executed the below query we found almost 170 agent node are throwing the same error. The below host mentioned in the log_level error refers to Deployment server.
Query Details:
index=_internal log_level=ERROR "Connection to host=168.x.x.x:9997 failed" | dedup host | table host
Among 170 agents some are running with older version which we are planning to upgrade soon but rest of the agent are with 6.2 version and we are able to see s data when searching with that host but need to know what is causing this issue. So kindly guide me how to troubleshoot this issue and also let me know is there any other query which can be used to find out exact issue.
Good day.
Are you saying the Deployment Server cannot connect to the Indexer(s)? The port in your example above "9997" is the defacto port for communication outbound from a forwarder to the Indexers. In that case, I would check the network paths between the Deployment Server and the Indexers. I would also check if the "Connection to Host" list represents some, or all of your Indexers. A telnet session from the DS to the Indexer listener port (9997) is an easy check.
Are you saying that Forwarders (agents) cannot connect to the Deployment Server? A Deployment Server accepts connections from Forwarders on the Management Port (8089.) In that case, I would check the network paths between the Forwarders and the Deployment Server. The network should allow inbound to the DS on port 8089. Also review the deploymentclient.conf on your Forwarders, and check that the correct IP and Port are being used. The Forwarder Management UI on the Deployment Server will display the host names from any Forwarders that have checked in.
Hi Ekost, Thanks for your effort on this , Yes I had checked all the parameter which is mentioned by you on comments for one of the RM agent node and found everything perfect with configurations and ports.
Details :
hostname=Rmnode01 --Remote Agent node
hostname=DPinstance --Deployment server
hostname=indexer --Indexer server
Telnet Details:
Telnet connectivity checked from remote node to Deployment server - Its connecting
[root@Rmnode01 default]# telnet DPinstance.com 8089 -- Remote Agent node
Trying 168.x.x.x...
Connected to DPinstance.com (168.x.x.x).
Escape character is '^]'.
Telnet connectivity checked from Deployment server to Remote node - its connecting
[root@DPinstance ~]# telnet Rmnode01.xxx.com 8089 -- Deployement server
Trying 10..x.x.x...
Connected to Rmnode01.xxx.com.
Escape character is '^]'.
Telnet connectivity checked from Remote node to Indexer instance -- Its connecting
[root@Rmnode01 default]# telnet indexer.xxx.com 9997 -- Indexer instance
Trying 168.x.x.x...
Connected to indexer.xxx.com (168.x.x.x).
Escape character is '^]'.
[root@Rmnode01 ~]# cd /opt/splunkforwarder/bin
[root@Rmnode01 bin]# ./splunk version
Splunk Universal Forwarder 6.2.0 (build 237341)
But in Deployment server we noticed that open file parameter was set to 1024 which is very less when compared to the splunk recommendation, so we are advised to increase the parameter to minimum 81942 or more based on our environment setup. I hope this should fix the issue, i will update on changing the parameter in /etc/security/limits.conf.
Good day. From your comments above, I'm assuming the issue is determining why some forwarders on occasion will not connect to the indexers. There are a lot of potential reasons why an Indexer wouldn't be reachable for a period of time. If the indexer's processing queues are blocked, it could stop accepting new forwarder (agent) connections until the queues clear. The Management Console is used to monitor the queues on the indexers. Here's a link to the configuration docs for the Console. Queues can become blocked for a lot of reasons, including invalid or poor quality props/transforms, slow storage IOPS, and so on. On the Indexers, make sure the Transparent Huge Pages is disabled, check the ulimits, and test the storage (FIO, or Bonnie++) to validate that the Indexers have the best chance of performing consistently.
Another common scenario is one forwarder agent getting locked for long periods of time to one Indexer. This is usually due to a single, large data source coming from that forwarder that doesn't line break. As a result, one Indexer will be the recipient of one data source (no distribution.) And that Indexer will not be available to accept data from other forwarders.