The Splunk Plug-in for Jenkins has been configured to send to a Splunk server using a token to the Splunk server's HEC (HTTP Event Collector) and the jenkins_hec type.
A connection is made, data is sent, and eventually the plug-in hangs. Similar to question 479707, it is observed that there is a blocked queue on the Splunk server:
metrics.log.5:05-05-2017 21:10:56.527 -0400 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=150, largest_size=150, smallest_size=0
Shortly after, the Jenkins server begins logging that the Splunk server is busy:
May 05, 2017 10:04:34 PM com.splunk.splunkjenkins.utils.LogConsumer handleRetry
WARNING: Server is busy, maybe caused by blocked queue, please check https://wiki.splunk.com/Community:TroubleshootingBlockedQueues, will wait 30 seconds and retry
the indexqueue is no longer blocked, and it was only blocked for that one instance. However Jenkins will not reconnect and continues to throw errors.
From Jenkins configuration - the "test connection" button yields:
token:AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE response:Service Unavailable
However a curl check from the Jenkins server to the HEC listener on the Splunk server succeeds:
-sh-4.2$ curl -k "https://host:8088/services/collector?channel=AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE" -H 'Authorization: Splunk AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE' -d '{"index": "jenkins", "sourcetype":"access", "source":"/var/log/access.log", "event": {"message":"Access log test message"}}'
{"text":"Success","code":0,"ackId":2}-sh-4.2$
The only observable condition is that the Jenkins server appears to have 8 simultaneous connections to the Splunk HEC server - even when Jenkins plug-in is disabled and HEC has been disabled:
-sh-4.2$ netstat -an | grep 8088
tcp 0 0 0.0.0.0:8088 0.0.0.0:* LISTEN
tcp 0 0 1.2.3.4:8088 1.2.3.5:38114 ESTABLISHED
tcp 0 0 1.2.3.4:8088 1.2.3.5:44826 ESTABLISHED
tcp 0 0 1.2.3.4:8088 1.2.3.5:44826 ESTABLISHED
tcp 0 0 1.2.3.4:8088 1.2.3.5:44510 ESTABLISHED
tcp 0 0 1.2.3.4:8088 1.2.3.5:46606 ESTABLISHED
tcp 0 0 1.2.3.4:8088 1.2.3.5:47516 ESTABLISHED
It appears as if the connector cycles through its "Retries on Error" (which I have set to 3) and then ceases to try to connect even when the connector is available again.
Is there a setting or other configuration that can be applied which will allow:
a) Jenkins plug-in to release its HEC connections on the Splunk server (i.e., a timeout) during a failed state
b) Flexibility to pause / resume or retry once it observes a blocked queue (as these are temporary)
Increasing the Splunk Forwarder max buffer size isn't the issue here. The only recourse appears to be restarting Jenkins, and it will repeat the above.
... View more