Symptoms and tests to confirm
The entire cluster becomes unstable with the Cluster Master showing flapping of indexers from up to down. With farm of two layer proxy servers.
You will see intermittent HTTP errors uploading to smart store.
Step 2. Block the connection from peers to S3 using
echo "127.0.0.1 s3-us-west-2.amazonaws.com" >> /etc/hosts
What was observed -
1. Peers unable to upload the buckets to remote storage(which is obvious)
2. Peers constantly retrying to upload the buckets
3. Peers were marked Down by CM since peers could not heartbeat to the CM as they were constantly busy retrying the upload of buckets with so many threads in parallel, which is causing extra pressure on CMSLave lock.
Below is the pstack i collected from one of the indexer -
Thread which is holding the CMSlave lock while making a S3 Head request to check if file is present or not on S3:
Thread 14 (Thread 0x7f8b04dff700 (LWP 8834)):
0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
1 0x00005639b0d24e27 in EventLoop::run() ()
2 0x00005639b0dece00 in TcpOutboundLoop::run() ()
3 0x00005639b08928e9 in RetryableClientTransaction::_run_sync(bool) ()
4 0x00005639b0930c44 in S3StorageInterface::fileExists(StorageObject const&, Str*, RemoteRetryPolicy*) ()
4 0x00005639b1005b8c in CMHeartbeatThread::when_expired(Interval*) ()
5 0x00005639b0df634c in TimeoutHeap::runExpiredTimeouts(MonotonicTime&) ()
6 0x00005639b0d24d86 in EventLoop::run() ()
7 0x00005639b01225da in CMServiceThread::main() ()
8 0x00005639b0dedfa9 in Thread::callMain(void*) ()
9 0x00007f8b0d9614a4 in start_thread (arg=0x7f8afa7ff700) at pthread_create.c:456
10 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Even searches might be blocked on this lock -
Thread 81 (Thread 0x7f8afe1ff700 (LWP 10428)):
0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
1 0x00007f8b0d963bb5 in GI_pthread_mutex_lock (mutex=0x7f8b0d0818f8) at ../nptl/pthread_mutex_lock.c:80
2 0x00005639b0dedcd9 in PthreadMutexImpl::lock() ()
3 0x00005639b0f948ec in CMSlave::writeBucketsToSearch(unsigned long, Clustering::SiteId const&, Clustering::SummaryAction, Str&) ()
4 0x00005639b13a0822 in DispatchCommand::dumpClusterSlaveBuckets(SearchResultsInfo&) ()
5 0x00005639b1429152 in StreamedSearchDataProvider::handleStreamConnectionImpl(HttpCompressingServerTransaction&, SearchResultsInfo*, Str*) ()
6 0x00005639b142bbb5 in StreamedSearchDataProvider::handleStreamConnection(HttpCompressingServerTransaction&) ()
7 0x00005639b0c38d4d in MHTTPStreamDataProvider::streamBody() ()
8 0x00005639b07db115 in ServicesEndpointReplyDataProvider::produceBody() ()
9 0x00005639b07d28ff in RawRestHttpHandler::getBody(HttpServerTransaction*) ()
10 0x00005639b0d558fb in HttpThreadedCommunicationHandler::communicate(TcpSyncDataBuffer&) ()
11 0x00005639b0119e42 in TcpChannelThread::main() ()
12 0x00005639b0dedfa9 in Thread::callMain(void*) ()
13 0x00007f8b0d9614a4 in start_thread (arg=0x7f8afe1ff700) at pthread_create.c:456
14 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
This explains why the cluster was super unstable when we had issues uploading the bucket and explains #1 and #3.
This dependency on CMSlave lock has already been fixed in 8.0.1
About #2 since customer set max_concurrent_downloads/uploads = 200, there were so many concurrent uploads to S3, via proxy that it locked out and started backing up. At one time, it closed the connection on indexers and upload retries started and timeouts appeared.
Apply 8.0.1 verified using the tests above that it no longer blocks as dependency on CMSlave lock has been fixed
Solution: Apply maintenance release 8.0.1
Fix is covered in JIRA SPL-177646 dependency on CMSlave lock has been fixed
Description: Splunk unable to upload to S3 with Smartstore through Proxy
Reported as IDX cluster smartstore enabled with problems during upload to S3 and S2 enabled IDX cluster showing upload and download failure
Configurations changed in server.conf max_concurrent_uploads/downloads to unusually high values which caused pressure via proxy between indexers and S3.
Too many connections to S3 from indexers (via proxy server) and Splunk CORE (CMSlave lock algorithm deficiency) couldn’t handle it. Hence the request to upload/download could not proceed and eventually failing on uploads.
A sudden increase in uploads/downloads, has lead to a throttling affect via the proxy from indexer to the S3 storage system and overloaded splunk CORE and this backed up (and timed out) on the indexer side preventing downloads/uploads.
Workaround (if can't migrate to 8.0.1 straight away)
Setting these back to default values helped - max_concurrent_uploads/downloads.