All Apps and Add-ons

Why does the Splunk Plugin for Jenkins hang and is unable to reconnect?

jb_spelunker
Explorer

The Splunk Plug-in for Jenkins has been configured to send to a Splunk server using a token to the Splunk server's HEC (HTTP Event Collector) and the jenkins_hec type.

A connection is made, data is sent, and eventually the plug-in hangs. Similar to question 479707, it is observed that there is a blocked queue on the Splunk server:

metrics.log.5:05-05-2017 21:10:56.527 -0400 INFO  Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=500, current_size_kb=499, current_size=150, largest_size=150, smallest_size=0

Shortly after, the Jenkins server begins logging that the Splunk server is busy:

May 05, 2017 10:04:34 PM com.splunk.splunkjenkins.utils.LogConsumer handleRetry
WARNING: Server is busy, maybe caused by blocked queue, please check https://wiki.splunk.com/Community:TroubleshootingBlockedQueues, will wait 30 seconds and retry

the indexqueue is no longer blocked, and it was only blocked for that one instance. However Jenkins will not reconnect and continues to throw errors.

From Jenkins configuration - the "test connection" button yields:
token:AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE response:Service Unavailable

However a curl check from the Jenkins server to the HEC listener on the Splunk server succeeds:

-sh-4.2$ curl -k "https://host:8088/services/collector?channel=AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE" -H 'Authorization: Splunk AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE' -d '{"index": "jenkins", "sourcetype":"access", "source":"/var/log/access.log", "event": {"message":"Access log test message"}}'
{"text":"Success","code":0,"ackId":2}-sh-4.2$

The only observable condition is that the Jenkins server appears to have 8 simultaneous connections to the Splunk HEC server - even when Jenkins plug-in is disabled and HEC has been disabled:

-sh-4.2$ netstat -an | grep 8088
tcp        0      0 0.0.0.0:8088            0.0.0.0:*               LISTEN
tcp        0      0 1.2.3.4:8088          1.2.3.5:38114       ESTABLISHED
tcp        0      0 1.2.3.4:8088          1.2.3.5:44826       ESTABLISHED
tcp        0      0 1.2.3.4:8088          1.2.3.5:44826       ESTABLISHED
tcp        0      0 1.2.3.4:8088          1.2.3.5:44510       ESTABLISHED
tcp        0      0 1.2.3.4:8088          1.2.3.5:46606       ESTABLISHED
tcp        0      0 1.2.3.4:8088          1.2.3.5:47516       ESTABLISHED

It appears as if the connector cycles through its "Retries on Error" (which I have set to 3) and then ceases to try to connect even when the connector is available again.

Is there a setting or other configuration that can be applied which will allow:
a) Jenkins plug-in to release its HEC connections on the Splunk server (i.e., a timeout) during a failed state
b) Flexibility to pause / resume or retry once it observes a blocked queue (as these are temporary)

Increasing the Splunk Forwarder max buffer size isn't the issue here. The only recourse appears to be restarting Jenkins, and it will repeat the above.

0 Karma

jb_spelunker
Explorer

This issue has been resolved, the plugin has now been running for several weeks without any issue.

The solution involves two parts:

1) Modify the "Splunk for Jenkins Configuration" on the Jenkins server to include custom Groovy script code to place limits on what is sent:

//send job metadata and junit reports with page size set to 50 (each event contains max 50 test cases)
sendTestReport(50)
//send coverage, each event contains max 50 class metrics
sendCoverageReport(50)
//send all logs from workspace to splunk, with each file size limits to 10MB
archive("*/.log", null, false, "10MB")

2) Modify the Jenkins HTTP Event Collector and uncheck the "Enable indexer acknowledgement"

This prevents the dying threads on the Jenkins server from restarting and going into an exponential thread retry catastrophy. 4 threads became 8, 16, 32, 64, 128, etc - eventually logging 600+ times per second on the Jenkins server connection failed, and causing the threads to die off and not be restarted by the Jenkins master JVM.

jessicawelch
Engager

Hi jb_spelunker - were you able to find a solution? We are seeing the exact issues you reported.

0 Karma

GSolasa
New Member

I am facing a similar issue where I can see some warnings and errors,which also caused Jenkins shutdown. Is there a way we can prevent it from happening in the future

23-Apr-2018 10:14:13.433 WARNING [splunkins-worker-1] com.splunk.splunkjenkins.utils.LogConsumer.handleRetry Server is busy, maybe caused by blocked queue, please check https://wiki.splunk.com/Community:TroubleshootingBlockedQueues, will wait 30 seconds and retry
23-Apr-2018 10:14:13.111 WARNING [splunkins-worker-2] com.splunk.splunkjenkins.utils.LogConsumer.handleRetry Server is busy, maybe caused by blocked queue, please check https://wiki.splunk.com/Community:TroubleshootingBlockedQueues, will wait 30 seconds and retry
23-Apr-2018 10:14:36.362 SEVERE [http-nio-8080-Acceptor-0] hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler.uncaughtException A thread (http-nio-8080-Acceptor-0/72) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.tomcat.util.net.SocketBufferHandler.(SocketBufferHandler.java:41)
at org.apache.tomcat.util.net.NioEndpoint.setSocketOptions(NioEndpoint.java:375)
at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:473)
at java.lang.Thread.run(Thread.java:745)

0 Karma

txiao_splunk
Splunk Employee
Splunk Employee

Are you using Round-robin DNS for load balancing? if that is the cases, you may need tune dns caching since Jenkins is running by Java and Java caches DNS entries forever. java dns caching

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...