Deployment Architecture

Splunk shutdown procedure delayed by TcpInputProc

tomasztomasz
Loves-to-Learn

We have noticed that during the HFW shutdown procedure (e.g. caused by Parsing app deployment) there is a sequence of events for what seems to be closing the active TCP incoming connections. An example as follows:

 

09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Running shutdown level 1. Closing listening ports.
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Done setting shutdown in progress signal.
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Shutting down listening ports
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Stopping IPv4 port 9997
09-08-2020 15:12:38.606 +0100 INFO  TcpInputProc - Setting up input quiesce timeout for : 90.000 secs
09-08-2020 15:12:38.942 +0100 INFO  TcpInputProc - Waiting for connection from src=172.18.18.185:64536, 172.30.194.1:58219, 172.16.57.76:49451, 172.16.218.6:52112, 172.30.50.1:34143, 172.16.36.20:50702, 172.18.13.28:39612, 172.30.66.2:47563, 172.16.57.79:54330, 172.16.165.70:57168 ...  to close before shutting down TcpInputProcessor.
...
09-08-2020 15:14:19.103 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:20.123 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:21.138 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown
09-08-2020 15:14:22.172 +0100 WARN  TcpInputProc - Could not process data received from network. Aborting due to shutdown

 

 

Now, what worries me is that the number of "TcpInputProc - Could not process data received from network. Aborting due to shutdown" events can vary anything from 20 (which in total takes 15-20 seconds) to 100s (which can be as long as 4 minutes). The more those events I have, the longer the shutdown procedure takes and the longer the HFW remains inactive (unable to process data). Eventually, the shutdown process is forced after the default 360s.

Questions:
1. Why do we see different number of "TcpInputProc - Could not process data received from network. Aborting due to shutdown" events on different occasions?
2. Is there any way of limiting them and generally speeding up the shutdown procedure? Maybe there is some tuning we can do on the HFW nodes?

Labels (1)
0 Karma

isoutamo
SplunkTrust
SplunkTrust

Amount of those are dependent of how much your UFs and other clients are sending to your HF. Splunk try to close those connections cleanly before it stops. Based on events it takes different amount of time case by case.

One way to shortening this time is put HF first in detention mode to prevent receiving any new connections. But probably this is not worth of it?

r. Ismo

0 Karma

tomasztomasz
Loves-to-Learn

Thanks @isoutamo! I came to the same conclusion that it really depends on how many connections a HF has to deal with during the shutdown procedure.  What I am worried about is that when a deployment is due and all HFs need to restart, the long shutdown procedure makes the HFs farm not available for UFs to send the data (for as long as 4-5 minutes). Ideally, I would like to restart he HFs as quickly as possible. 

Never heard about the detention mode. I will search for the term in Splunk docs. 

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...