Re: UniversalForwarder fails to connect with "No c...

naydenk · ‎10-02-2012

Hello

I have multiple client UFs sending to an intermediary UF, which then forwards to an indexer. Sporadically, the client UFs will log the messages below:

10-02-2012 07:31:58.403 -0400 WARN  TcpOutputProc - Cooked connection to ip=10.x.y.60:9997 timed out
10-02-2012 07:32:28.389 -0400 WARN  TcpOutputProc - Cooked connection to ip=10.x.y.60:9997 timed out
10-02-2012 07:32:39.389 -0400 WARN  TcpOutputFd - Connect to 10.x.y.60:9997 failed. No connection could be made because the target machine actively refused it.
10-02-2012 07:32:39.389 -0400 ERROR TcpOutputFd - Connection to host=10.x.y.60:9997 failed
10-02-2012 07:32:39.389 -0400 WARN  TcpOutputProc - Applying quarantine to idx=10.x.y.60:9997 numberOfFailures=2

Also, when I check the port, the intermediary UF is indeed NOT listening (as seen from the client UF side):

c:\Program Files\SplunkUniversalForwarder\bin>portqry -n 10.x.y.60 -e 9997 -i
 Querying target system called:
  10.x.y.60
TCP port 9997 (unknown service): NOT LISTENING

A couple of seconds later, the same command will show it LISTENING. In addition, the client UF and intermediary UF are on the same LAN, so latency between them is <1ms.

At the same time, the intermediary UF logs this:

10-02-2012 07:32:26.235 -0400 WARN  PipelineInputChannel - channel "source::C:\Logs\BI-msg.txt20120915|host::MYHOST2|BI-msg-2|remoteport::64024" ended without a done-key
10-02-2012 07:32:33.735 -0400 WARN  PipelineInputChannel - channel "source::C:\Logs\AppLog.txt|host::MYHOST1|AppLog|remoteport::62340" ended without a done-key
10-02-2012 07:32:33.735 -0400 WARN  PipelineInputChannel - channel "source::C:\Logs\AppLog.txt.4|host::MYHOST1|AppLog|remoteport::62340" ended without a done-key
10-02-2012 07:32:51.736 -0400 WARN  PipelineInputChannel - channel "source::C:\Logs\AppLog.txt|host::MYHOST1|AppLog|remoteport::62339" ended without a done-key

and splunkd on the intermediary has not crashed. The metrics.log file on the intermediary UF does not show blocked queues either.

Some of the data Splunk is supposed to index on the client is not making it to the indexer, I am guessing because of all this flapping... Any ideas why this is?

bmacias84 · ‎11-09-2012

Ok, this is my understanding of your problem. Sporadically CUF (Client Universal Forwarders) have communication issues with IUF (Intermediary Universal Forwarder). The common the denominator is your IUF. It’s a possibly that your IUF is unable to service those connection or is unable to respond in a timely fashion. Here are my theories.

Port exhaustion. If memory serves me windows has a max concurrent TCP connection of 16 million plus, but other params don’t allow you to reach this. First is the MaxUserPort wich is 5000-1024 giving you 3977 by default for outbound concurrent connections (active + time wait = concurrent connections). Client makes a connect to a server, the client implicit bind the socket to a local dynamic port, one of the 3977 availble ports. Then TCP Control Block (TCB) limits how many connections TCP can support simultaneously default is 2000. TCB Tables allows Windows to quickly find a certain active connection, if this is too small extra CPU time required to find a connection is required, default is 512. Even if you allocate 1000 TCBs does not mean you will have 1000 active connections. If an app quickly opening and closing connections, closed connection enter a Time_wait state that occupies the port for 4 minutes. Also Windows will sometimes start SYN-Attack-Protect if certain thresholds are exceeded (I encountered this personally). With R2 2k8 SP1 you many need to verify what the default are for sure.
VM not able to be schedule quick enough on physical hardware. VMs work in apparent time not real time, given time slices to execute on physical hardware. With 4 vCPUs a VM the VM must wait for 4 physical CPU to be available or the VM must wait. Give a VM more vCPUs call cause performance issues. This can be verified by checked the CPU ready time. This is the time the VM is spending waiting to be scheduled. High values many indicate a problem.

Hope this helps or gets you started with some ideas. Cheers

bmacias84 · ‎11-09-2012

Here a good article on CPU ready stats. http://www.boche.net/blog/index.php/2010/10/21/cpu-ready-to-rdy-conversion/. Its possible that your output queue is maxed its controlled by this setting maxQueueSize on your outputs.conf. Might try increasing your logging channels for outputs. To you have Indexer acknoledgement turned on?

naydenk · ‎11-09-2012

OK, I will try typeperf and post back. Thanks!

naydenk · ‎11-09-2012

one last thing, when the CUF connections get dropped from ESTABLISHED to CLOSE_WAIT the splunkd.log has these:

11-09-2012 16:09:59.416 -0500 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
11-09-2012 16:11:09.547 -0500 INFO TailingProcessor - ...continuing.
11-09-2012 16:11:19.565 -0500 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...

I will keep digging and post back any resolution.

naydenk · ‎11-09-2012

vCPUs - the Ready counter was at ~150ms with 4 vCPUs. I dropped the CPUs to 2, which dropped it to ~50ms (so I guess an approx 25ms improvement). Then I isolated the box onto its own host - a beefy box (2 physical 12-core CPUs, 96 GB RAM). Being the single guest on the host the Ready counter dropped to ~30ms. Not sure if this is a good value. The other IUF (handling 600 CUFs) shows a Ready value of 45 with 4 vCPUs, so it id definitely a little better, but I dont think that would explain it.

I am still leaning toward something with either splunkd config or TCP tunning needed due to the latency

bmacias84 · ‎11-09-2012

@naydenk, I would grab the TCP performance stats using typeperf.exe against both IUF. Then compare your TCP stats over a 24hr peroid. You might seen a problem with half-open pipes.

naydenk · ‎11-09-2012

Thanks bmacias84, these are definitely helpful suggestions. Here is what I have done so far:

Port exchaustion - I don't think this is the case - at any given time I have less than 1000 ports used, between ESTABLISHED, CLOSE_WAIT and TIME_WAIT states. The server is supposed to get ~300 clients connecting to it, which fail over to the remote IUF, and that one handles the total ~600 without issues (and as I mentioned both UFs are identically built). I will have to look into your TCB/SYN attack suggestions a little further.

I will continue in the next comment...

naydenk · ‎11-09-2012

Yes, it is a VM and it is Windows 2008 R2 Enterprise SP1, 8GB RAM 4 vCPUs. It is the exact same config as the IUF in the datacenter closer to the indexer, by the way (and that IUF is not showing issues).

The CUFs are setup to LB between the two IUFs (initially I had them only pointing to the closer one). Total count between the two datacenters is approx 500, split evenly between the two.

bmacias84 · ‎11-09-2012

@naydenk, A couple of questions. Is your intermediary UF a VM? Is it Windows, if so what version and SP? How many UF are connecting to your intermediary UFs? Does your UF server other applications? How many intermediary UF do you have and UF load-balanced across them?

naydenk · ‎11-09-2012

Nothing has been changed on the intermediary UFs, all default settings.

The IUF that is having the issue is in a remote datacenter, ~70 ms away from the indexer. It did not used to have the issue when it was just a few (<50) client UFs forwarding to it. It could be the latency, because the IUF that is in the same datacenter as the indexer does not exhibit this behavior, but... But there has to be a way around this - I can't imagine I am the only person forwarding splunk data across WANs...

sowings · ‎11-09-2012

Sounds almost like an issue with the network stack. Have you done any tuning of TCP parameters?

hagjos43 · ‎11-09-2012

Ever get a response to this or figure it out? We are having a similar issue.

naydenk · ‎11-09-2012

I have a case open with Splunk Support. They are trying to figure it out. I will post my findings and their reponse as soon as I get anything good.

UniversalForwarder fails to connect with "No connection could be made because the target machine actively refused it"

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Splunk Observability for AI

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler

Are you a member of the Splunk Community?

UniversalForwarder fails to connect with "No connection could be made because the target machine actively refused it"

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Splunk Observability for AI

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler