Symptoms and tests to confirm
The entire cluster becomes unstable with the Cluster Master showing flapping of indexers from up to down. With farm of two layer proxy servers.
You will see intermittent HTTP errors uploading to smart store.
10-07-2019 15:13:42.821 +0100 ERROR RetryableClientTransaction - transactionDone(): groupId=(nil) rTxnId=… transactionId=…. success=N HTTP-statusCode=502 HTTP-statusDescription="network error" retries=0 retry=N no_retry_reason="no retry policy" remainingTxns=0
10-07-2019 15:13:42.821 +0100 ERROR CacheManager - action=upload, cache_id="bid|_internal~….|", status=failed, unable to check if receipt exists at path=_internal/db/…/receipt.json(0,-1,), error="network error"
10-07-2019 15:13:42.821 +0100 ERROR CacheManager - action=upload, cache_id="bid|_internal~…|", status=failed, elapsed_ms=15016
Crashlogs with:
[build 7651b7244cf2] 2019-10-07 11:17:36
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 2599 running under UID 0.
Crashing thread: cachemanagerUploadExecutorWorker-180
Testing: ./splunk cmd splunkd rfs – ls --starts-with volume:XXXXXXX Returns no results because of Connection Timeout with Bad Gateway 502
Testing: wget on aws s3 instance returns bad gateway.
To confirm the issue with a repro
Step 1. change below parameters values in sever.conf to 200
[cachemanager]
max_concurrent_downloads = 200
max_concurrent_uploads = 200
Step 2. Block the connection from peers to S3 using
echo "127.0.0.1 s3-us-west-2.amazonaws.com" >> /etc/hosts
What was observed -
1. Peers unable to upload the buckets to remote storage(which is obvious)
2. Peers constantly retrying to upload the buckets
3. Peers were marked Down by CM since peers could not heartbeat to the CM as they were constantly busy retrying the upload of buckets with so many threads in parallel, which is causing extra pressure on CMSLave lock.
Below is the pstack i collected from one of the indexer -
Thread which is holding the CMSlave lock while making a S3 Head request to check if file is present or not on S3:
Thread 14 (Thread 0x7f8b04dff700 (LWP 8834)):
While other threads such heartbeat thread and other operations are waiting for this lock to be released -
Heartbeat thread waiting for the lock-
Thread 60 (Thread 0x7f8afa7ff700 (LWP 9053)):
Even searches might be blocked on this lock -
Thread 81 (Thread 0x7f8afe1ff700 (LWP 10428)):
This explains why the cluster was super unstable when we had issues uploading the bucket and explains #1 and #3.
This dependency on CMSlave lock has already been fixed in 8.0.1
About #2 since customer set max_concurrent_downloads/uploads = 200, there were so many concurrent uploads to S3, via proxy that it locked out and started backing up. At one time, it closed the connection on indexers and upload retries started and timeouts appeared.
Apply 8.0.1 verified using the tests above that it no longer blocks as dependency on CMSlave lock has been fixed
Solution: Apply maintenance release 8.0.1
Fix is covered in JIRA SPL-177646 dependency on CMSlave lock has been fixed
Description: Splunk unable to upload to S3 with Smartstore through Proxy
Reported as IDX cluster smartstore enabled with problems during upload to S3 and S2 enabled IDX cluster showing upload and download failure
Root cause:
Configurations changed in server.conf max_concurrent_uploads/downloads to unusually high values which caused pressure via proxy between indexers and S3.
Too many connections to S3 from indexers (via proxy server) and Splunk CORE (CMSlave lock algorithm deficiency) couldn’t handle it. Hence the request to upload/download could not proceed and eventually failing on uploads.
A sudden increase in uploads/downloads, has lead to a throttling affect via the proxy from indexer to the S3 storage system and overloaded splunk CORE and this backed up (and timed out) on the indexer side preventing downloads/uploads.
Workaround (if can't migrate to 8.0.1 straight away)
Setting these back to default values helped - max_concurrent_uploads/downloads.