- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
since we rolled out a couple of houndred forwarder, we do have connection errors.
If I do a telnet from a forwarder (unix), sometimes I get an answer, sometimes I doesn't. If it works, we get events.
On the indexer I can recognize this error event
ERROR TcpInputProc - Error encountered for connection from ... timeout
I have a lot of them...
The forwarders and indexer are in the same subnet. We already installed a new one to verifiy if we have an issue in our configuration. With a new indexer we have the same issue.
On the forwarder side we have the following warn message
TcpOutputProc - Raw connection to ip ... :9997 timed out
Does anyone have had the same issue?
thanks in advance
Regards.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
this error is caused by the heartbeat function. every 30 seconds the heartbeat will send to indexer. if the indexer don't get it during that time, the indexer writes a log with the timeout message. network devices like a firewall can causing this or long remote connections. I disabled the heartbeat. Other solution could be change the time frequency from 30 seconds...
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are all of your SE servers using NTP and do you have the correct DNS records loaded? Timing and authentication can cause issues on you Splunk infrastructure.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
this error is caused by the heartbeat function. every 30 seconds the heartbeat will send to indexer. if the indexer don't get it during that time, the indexer writes a log with the timeout message. network devices like a firewall can causing this or long remote connections. I disabled the heartbeat. Other solution could be change the time frequency from 30 seconds...
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It could be that you are overloading your network a/o indexer.
Did the problem always exist, or did it start occurring once you reached a certain number of forwarders sending data?
Have you installed the Deployment Monitor app? It ships with splunk by default, you just need to enable it. This can give you some insights into congestion problems.
Please tell us more of your HW/SW configuration (OS, version of splunk etc etc)
UPDATE:
Does the error occur for a particular type of forwarder?
Are your ulimit
and other OS settings (forwarder and indexer) the same as for the other (functioning) landscape?
Are there intermediate network components that might be causing trouble (switches, routers, firewalls)?
Does the problem go away when you have lower loads (e.g. at night)?
/K
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there, the error seems to have disappeared when I moved from a Universal Forwarder configuration to a Light Forwarder.
I was not able to get data into my indexers, but I'm not sure if this error had anything to do with it.
The error was appearing with as few as 4 hosts, so I don't think it's related to a network load issue.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just for your information: at the moment it seems like an normal behaivor. We think that this "error messages" don't influences the Splunk indexing behaivor.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@JasonCzerak: Did you find the solution or any hints for that?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the same problem. The forwarders are on the same subnet as the intermediate forwarder. With just as little as 10 connections to it would error out.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, to reply to this thread all you need to do is to click "comment on this answer" below this message, saves me converting your answers to comments 😉
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are several tools for this, depending on your OS, but common ones include WireShark or tcpdump. /k
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks. How does it works, the packet caputure on an indexer?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This sounds alot like a firewalling or a stateful issue. Do you have a firewall between the machines? I've seen this previously where a firewall has decided that either it doesn't allow a tcp connection or it times one out too quickly or decides it has been open too long. Perhaps it would be useful to do a packet capture on the indexer?
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for response.
Yes, if we have less connections, we don't have trouble. It just came up with more forwarder connections.
We installed a new Indexer with different hardware (other switchports, other layer3 components).
We don't have this issue an all forwarder, just a couple of houndreds which are located in different subnets.
It is definitly an issue with the three way handshake (doens't complete successfull). Means the TCP connection between Forwarder <-> Indexer work properly. All firewall logs are checked, no noticeable events.
We opened an Splunk support ticket today.
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
update with further questions above. /k
- Mark as New
- Bookmark Message
- Subscribe to Message
- Mute Message
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your response. The indexer doesn't have the status "overloaded".
Befor we rolled out the new forwarders (~1000), we had a couple of houndreds without this errors.
All queues are fine.
We already checked S.o.S and Deployment Monitor without any helpfull message. The only message I got is what I pasted before.
The indexer is a powerfull Quadcore machine with 16 GB of RAM. The indexes are located on a Netapp. The Splunk version is 4.3.1 and also the forwarders.
We already tried the same scenario with 4.3., same behaivor.
At the moment the network team is checking all points.
We do have exactly the same configurations in other landscapes (HW/SW) without any problems. And in other landscapes we have a lot more forwarders.
