We have noticed that during the HFW shutdown procedure (e.g. caused by Parsing app deployment) there is a sequence of events for what seems to be closing the active TCP incoming connections. An example as follows:
09-08-2020 15:12:38.606 +0100 INFO TcpInputProc - Running shutdown level 1. Closing listening ports.
09-08-2020 15:12:38.606 +0100 INFO TcpInputProc - Done setting shutdown in progress signal.
09-08-2020 15:12:38.606 +0100 INFO TcpInputProc - Shutting down listening ports
09-08-2020 15:12:38.606 +0100 INFO TcpInputProc - Stopping IPv4 port 9997
09-08-2020 15:12:38.606 +0100 INFO TcpInputProc - Setting up input quiesce timeout for : 90.000 secs
09-08-2020 15:12:38.942 +0100 INFO TcpInputProc - Waiting for connection from src=172.18.18.185:64536, 172.30.194.1:58219, 172.16.57.76:49451, 172.16.218.6:52112, 172.30.50.1:34143, 172.16.36.20:50702, 172.18.13.28:39612, 172.30.66.2:47563, 172.16.57.79:54330, 172.16.165.70:57168 ... to close before shutting down TcpInputProcessor.
...
09-08-2020 15:14:19.103 +0100 WARN TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:20.123 +0100 WARN TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:21.138 +0100 WARN TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:22.172 +0100 WARN TcpInputProc - Could not process data received from network. Aborting due to shutdown
Now, what worries me is that the number of "TcpInputProc - Could not process data received from network. Aborting due to shutdown" events can vary anything from 20 (which in total takes 15-20 seconds) to 100s (which can be as long as 4 minutes). The more those events I have, the longer the shutdown procedure takes and the longer the HFW remains inactive (unable to process data). Eventually, the shutdown process is forced after the default 360s.
Questions:
1. Why do we see different number of "TcpInputProc - Could not process data received from network. Aborting due to shutdown" events on different occasions?
2. Is there any way of limiting them and generally speeding up the shutdown procedure? Maybe there is some tuning we can do on the HFW nodes?
Amount of those are dependent of how much your UFs and other clients are sending to your HF. Splunk try to close those connections cleanly before it stops. Based on events it takes different amount of time case by case.
One way to shortening this time is put HF first in detention mode to prevent receiving any new connections. But probably this is not worth of it?
r. Ismo
Thanks @isoutamo! I came to the same conclusion that it really depends on how many connections a HF has to deal with during the shutdown procedure. What I am worried about is that when a deployment is due and all HFs need to restart, the long shutdown procedure makes the HFs farm not available for UFs to send the data (for as long as 4-5 minutes). Ideally, I would like to restart he HFs as quickly as possible.
Never heard about the detention mode. I will search for the term in Splunk docs.