Archive
Highlighted

TcpOutputProc - Cooked connection to ip=x.x.x.x:9997 timed out

Path Finder

I have seen several threads opened with this issue, but nothing that fits the situation we are facing.

This is taking place when our server farms are busy, and ALSO are behind an F5. So every server behind the F5 are timing out during busy peak times. I have a theory that all our sessions are being used up, and splunk just can't send during these times, but wanted to see if anyone else has experienced something similar?

Thanks!

0 Karma
Highlighted

Re: TcpOutputProc - Cooked connection to ip=x.x.x.x:9997 timed out

Builder

A few questions:
Are your servers using the f5 as their gateway back to the indexer?

Have you looked at the load, connection count, bandwidth and other stats on the F5 during peaks to see if there is some kind of plateau? Also have you checked the logs on the f5 for anything unusual?

Have you checked the load on the indexer during peak times? Perhaps it is underpowered for the input load.

How about other indexing functionality on the indexer during these times, are you indexing from any other machines and is this not impacted during the issue? (you could search the internal indexes for activity during these times if you have no other inputs)

This information might help the community provide assistance.

0 Karma
Highlighted

Re: TcpOutputProc - Cooked connection to ip=x.x.x.x:9997 timed out

Path Finder

Are your servers using the f5 as their gateway back to the indexer?
Yes
Have you looked at the load, connection count, bandwidth and other stats on the F5 during peaks to see if there is some kind of plateau? Also have you checked the logs on the f5 for anything unusual?
I do not have access to the F5, but I have looked over the shoulder of someone who does. He was saying that they are reaching the limit of threads to the server farms.
How about other indexing functionality on the indexer during these times, are you indexing from any other machines and is this not impacted during the issue? (you could search the internal indexes for activity during these times if you have no other inputs)
Yes. We are getting logs from everything except the servers behind the F5. I know. Pretty obvious.

0 Karma
Highlighted

Re: TcpOutputProc - Cooked connection to ip=x.x.x.x:9997 timed out

Builder

It looks like your universal forwarders are trying to open connections to send data to your indexer and are having the TCP connection time out due to some sort of threshold on your gateway device. You could possibly rule out this being anything specific to splunk by trying other outbound tcp connections from the servers which would traverse the same gateway device during the peak loads.

The comment about "reaching the limit of threads to the server farms" is to me ambiguous in f5 speak, so I would recommend that more detailed stats be collected about how many connections are in the various connection tables on the f5 when this is happening and get some of its load statistics as well.

If the UF can't connect to the indexer, then the default or configured queuing behaviors will be in effect on the forwarder and depending on what is configured you may or may not be protected from data loss. See here for more on that: http://docs.splunk.com/Documentation/Splunk/6.2.1/Forwarding/Protectagainstlossofin-flightdata

The behavior I observe on my universal forwarders is that TCP connections for forwarding data are not held open indefinitely so you will normally see one in ESTABLISHED state and a few in TIME_WAIT when looking at netstat on the forwarder. A few such connections is expected under load or even idle conditions in my experience.

You might want to look at netstat on the forwarders and indexers during this time to get a rough idea of how many TCP connections on port 9997 are in your splunk systems connection table. I would expect that during the problems with your gateway device (the f5) you would see a connection or two in SYN_SENT state as they are waiting for the ACK from the server. This also confirms that there is a connectivity issue.

You might want to consider using some type of alternate route if that is available if you are seeing some limitation on the f5 which cannot be overcome. In our environment we use SNAT on the f5s so that only the load-balanced application traffic flows through the f5 and the rest flows through a firewall.