Splunk shutdown procedure delayed by TcpInputProc

tomasztomasz · ‎02-18-2021

We have noticed that during the HFW shutdown procedure (e.g. caused by Parsing app deployment) there is a sequence of events for what seems to be closing the active TCP incoming connections. An example as follows:

09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Running shutdown level 1. Closing listening ports.
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Done setting shutdown in progress signal.
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Shutting down listening ports
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Stopping IPv4 port 9997
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Setting up input quiesce timeout for : 90.000 secs
09-08-2020 15:12:38.942 +0100 INFO  TcpInputProc - Waiting for connection from src=172.18.18.185:64536, 172.30.194.1:58219, 172.16.57.76:49451, 172.16.218.6:52112, 172.30.50.1:34143, 172.16.36.20:50702, 172.18.13.28:39612, 172.30.66.2:47563, 172.16.57.79:54330, 172.16.165.70:57168 ...  to close before shutting down TcpInputProcessor.
...
09-08-2020 15:14:19.103 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:20.123 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:21.138 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:22.172 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown

Now, what worries me is that the number of "TcpInputProc - Could not process data received from network. Aborting due to shutdown" events can vary anything from 20 (which in total takes 15-20 seconds) to 100s (which can be as long as 4 minutes). The more those events I have, the longer the shutdown procedure takes and the longer the HFW remains inactive (unable to process data). Eventually, the shutdown process is forced after the default 360s.

Questions:
1. Why do we see different number of "TcpInputProc - Could not process data received from network. Aborting due to shutdown" events on different occasions?
2. Is there any way of limiting them and generally speeding up the shutdown procedure? Maybe there is some tuning we can do on the HFW nodes?

isoutamo · ‎02-18-2021

Amount of those are dependent of how much your UFs and other clients are sending to your HF. Splunk try to close those connections cleanly before it stops. Based on events it takes different amount of time case by case.

One way to shortening this time is put HF first in detention mode to prevent receiving any new connections. But probably this is not worth of it?

r. Ismo

tomasztomasz · ‎02-24-2021

Thanks @isoutamo! I came to the same conclusion that it really depends on how many connections a HF has to deal with during the shutdown procedure. What I am worried about is that when a deployment is due and all HFs need to restart, the long shutdown procedure makes the HFs farm not available for UFs to send the data (for as long as 4-5 minutes). Ideally, I would like to restart he HFs as quickly as possible.

Never heard about the detention mode. I will search for the term in Splunk docs.

Splunk shutdown procedure delayed by TcpInputProc

heavy forwarder

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor

Are you a member of the Splunk Community?

Splunk shutdown procedure delayed by TcpInputProc

heavy forwarder

AppDynamics Summer Webinars

SOCin’ it to you at Splunk University

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor