Getting Data In

Using a proxy with S3 storage cause cluster become unstable with Indexers flapping from up to down

Splunk Employee
Splunk Employee

Symptoms and tests to confirm
The entire cluster becomes unstable with the Cluster Master showing flapping of indexers from up to down. With farm of two layer proxy servers.
You will see intermittent HTTP errors uploading to smart store.

10-07-2019 15:13:42.821 +0100 ERROR RetryableClientTransaction - transactionDone(): groupId=(nil) rTxnId=… transactionId=…. success=N HTTP-statusCode=502 HTTP-statusDescription="network error" retries=0 retry=N no_retry_reason="no retry policy" remainingTxns=0

10-07-2019 15:13:42.821 +0100 ERROR CacheManager - action=upload, cache_id="bid|_internal~….|", status=failed, unable to check if receipt exists at path=_internal/db/…/receipt.json(0,-1,), error="network error"

10-07-2019 15:13:42.821 +0100 ERROR CacheManager - action=upload, cache_id="bid|_internal~…|", status=failed, elapsed_ms=15016

Crashlogs with:

[build 7651b7244cf2] 2019-10-07 11:17:36
Received fatal signal 6 (Aborted).
Signal sent by PID 2599 running under UID 0.
Crashing thread: cachemanagerUploadExecutorWorker-180

Testing: ./splunk cmd splunkd rfs – ls --starts-with volume:XXXXXXX Returns no results because of Connection Timeout with Bad Gateway 502
Testing: wget on aws s3 instance returns bad gateway.

To confirm the issue with a repro

Step 1. change below parameters values in sever.conf to 200

max_concurrent_downloads = 200
max_concurrent_uploads = 200

Step 2. Block the connection from peers to S3 using
echo "" >> /etc/hosts

What was observed -
1. Peers unable to upload the buckets to remote storage(which is obvious)
2. Peers constantly retrying to upload the buckets
3. Peers were marked Down by CM since peers could not heartbeat to the CM as they were constantly busy retrying the upload of buckets with so many threads in parallel, which is causing extra pressure on CMSLave lock.
Below is the pstack i collected from one of the indexer -
Thread which is holding the CMSlave lock while making a S3 Head request to check if file is present or not on S3:

Thread 14 (Thread 0x7f8b04dff700 (LWP 8834)):

0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

1 0x00005639b0d24e27 in EventLoop::run() ()

2 0x00005639b0dece00 in TcpOutboundLoop::run() ()

3 0x00005639b08928e9 in RetryableClientTransaction::_run_sync(bool) ()

4 0x00005639b0930c44 in S3StorageInterface::fileExists(StorageObject const&, Str*, RemoteRetryPolicy*) ()

5 0x00005639b04eb4b0 in cachemanager::CacheManagerBackEnd::isRemoteBucketPresent(cachemanager::CacheId const&, Pathname const&, bool, ScopedPointer*) const ()

6 0x00005639b04f2bc1 in cachemanager::CacheManagerBackEnd::isBucketStable(cachemanager::CacheId const&, cachemanager::CacheManagerBackEnd::CheckScope, bool, ScopedPointer*) ()

7 0x00005639b03435c7 in DatabaseDirectoryManager::isBucketStable(cachemanager::CacheId const&, cachemanager::CacheManagerBackEnd::CheckScope, bool, bool, ScopedPointer*) ()

8 0x00005639b0f92f64 in CMSlave::manageReplicatedBucketsTimeoutS2_locked() ()

9 0x00005639b0f93c9d in CMSlave::service(bool) ()

10 0x00005639b00e09f3 in CallbackRunnerThread::main() ()

11 0x00005639b0dedfa9 in Thread::callMain(void*) ()

12 0x00007f8b0d9614a4 in start_thread (arg=0x7f8b04dff700) at pthread_create.c:456

13 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

While other threads such heartbeat thread and other operations are waiting for this lock to be released -
Heartbeat thread waiting for the lock-

Thread 60 (Thread 0x7f8afa7ff700 (LWP 9053)):

0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

1 0x00007f8b0d963bb5 in GI_pthread_mutex_lock (mutex=0x7f8b0d0818f8) at ../nptl/pthread_mutex_lock.c:80

2 0x00005639b0dedcd9 in PthreadMutexImpl::lock() ()

3 0x00005639b0f71f55 in CMSlave::getHbInfo(Str&, Str&, unsigned int&, CMPeerStatus::ManualDetention&, bool&, long&, unsigned long&) ()

4 0x00005639b1005b8c in CMHeartbeatThread::when_expired(Interval*) ()

5 0x00005639b0df634c in TimeoutHeap::runExpiredTimeouts(MonotonicTime&) ()

6 0x00005639b0d24d86 in EventLoop::run() ()

7 0x00005639b01225da in CMServiceThread::main() ()

8 0x00005639b0dedfa9 in Thread::callMain(void*) ()

9 0x00007f8b0d9614a4 in start_thread (arg=0x7f8afa7ff700) at pthread_create.c:456

10 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Even searches might be blocked on this lock -

Thread 81 (Thread 0x7f8afe1ff700 (LWP 10428)):

0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

1 0x00007f8b0d963bb5 in GI_pthread_mutex_lock (mutex=0x7f8b0d0818f8) at ../nptl/pthread_mutex_lock.c:80

2 0x00005639b0dedcd9 in PthreadMutexImpl::lock() ()

3 0x00005639b0f948ec in CMSlave::writeBucketsToSearch(unsigned long, Clustering::SiteId const&, Clustering::SummaryAction, Str&) ()

4 0x00005639b13a0822 in DispatchCommand::dumpClusterSlaveBuckets(SearchResultsInfo&) ()

5 0x00005639b1429152 in StreamedSearchDataProvider::handleStreamConnectionImpl(HttpCompressingServerTransaction&, SearchResultsInfo*, Str*) ()

6 0x00005639b142bbb5 in StreamedSearchDataProvider::handleStreamConnection(HttpCompressingServerTransaction&) ()

7 0x00005639b0c38d4d in MHTTPStreamDataProvider::streamBody() ()

8 0x00005639b07db115 in ServicesEndpointReplyDataProvider::produceBody() ()

9 0x00005639b07d28ff in RawRestHttpHandler::getBody(HttpServerTransaction*) ()

10 0x00005639b0d558fb in HttpThreadedCommunicationHandler::communicate(TcpSyncDataBuffer&) ()

11 0x00005639b0119e42 in TcpChannelThread::main() ()

12 0x00005639b0dedfa9 in Thread::callMain(void*) ()

13 0x00007f8b0d9614a4 in start_thread (arg=0x7f8afe1ff700) at pthread_create.c:456

14 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

This explains why the cluster was super unstable when we had issues uploading the bucket and explains #1 and #3.
This dependency on CMSlave lock has already been fixed in 8.0.1

About #2 since customer set max_concurrent_downloads/uploads = 200, there were so many concurrent uploads to S3, via proxy that it locked out and started backing up. At one time, it closed the connection on indexers and upload retries started and timeouts appeared.

0 Karma

Splunk Employee
Splunk Employee

Apply 8.0.1 verified using the tests above that it no longer blocks as dependency on CMSlave lock has been fixed

Solution: Apply maintenance release 8.0.1
Fix is covered in JIRA SPL-177646 dependency on CMSlave lock has been fixed

Description: Splunk unable to upload to S3 with Smartstore through Proxy
Reported as IDX cluster smartstore enabled with problems during upload to S3 and S2 enabled IDX cluster showing upload and download failure

Root cause:
Configurations changed in server.conf max_concurrent_uploads/downloads to unusually high values which caused pressure via proxy between indexers and S3.

Too many connections to S3 from indexers (via proxy server) and Splunk CORE (CMSlave lock algorithm deficiency) couldn’t handle it. Hence the request to upload/download could not proceed and eventually failing on uploads.

A sudden increase in uploads/downloads, has lead to a throttling affect via the proxy from indexer to the S3 storage system and overloaded splunk CORE and this backed up (and timed out) on the indexer side preventing downloads/uploads.

Workaround (if can't migrate to 8.0.1 straight away)
Setting these back to default values helped - max_concurrent_uploads/downloads.

Get Updates on the Splunk Community!

Observability | How to Think About Instrumentation Overhead (White Paper)

Novice observability practitioners are often overly obsessed with performance. They might approach ...

Cloud Platform | Get Resiliency in the Cloud Event (Register Now!)

IDC Report: Enterprises Gain Higher Efficiency and Resiliency With Migration to Cloud  Today many enterprises ...

The Great Resilience Quest: 10th Leaderboard Update

The tenth leaderboard update (11.23-12.05) for The Great Resilience Quest is out >> As our brave ...