We have 6 node Search head Cluster ( version 6.6.5), we are seeing . Since moving to SHC the Scheduled searches are either failing delegation, skipped and many in Continued status..
All the searches that are being skipped are due to error > The maximum number of concurrent running jobs for this historical scheduled search on this cluster has been reached
We have Splunk Deployed Azure cloud.
Looks like Azure SNAT process will drop the connection entries once the connections is idle for 4 minutes. This termination is not graceful, as it is not sending a TCP Reset to the connection end points. This behavior matches with the problematic TCP streams in the tcpdumps reviewed for the mongodb port 8191, across multiple SHs.
It seems, issue is possible if and only if the VM kernel's TCP keep alive timeout is higher than the Azure TCP Idle timeout. By the time VM stack starts sending keep alive, the SNAT already deleted the connection context. From the packet captures, we do see that the Splunk server instances are set with TCP keep-alive timeout value of 300 seconds. First keep-alive starts after 300 seconds.
This states that the VM’s kernel has the higher tcp timeout than at the VM public IP.
To resolve this issue we changed Change the net.ipv4.tcp_keepalive_time = 180
Here are some observation and resolution for this issue
We noticed errors in mongod.log and splunkd.log for Kvstore change indicated that log so we enabled DEBUG shows delete in Delete of ACCELERATE searches.
To debug this issue we enable DEBUG for
$HOME_SPLUNK/splunk set log-level KVStorageProvider -level DEBUG
$HOME_SPLUNK/splunk set log-level KVStoreConfigurationProvider -level DEBUG
To increase debugging for mongod we specified
[kvstore]
verbose = true
With the debug enabled logs we were able to trace timeouts to their originating actions, which as expected were deletions of ACCELERATE searches from mongo via SavedSearchHistory
06-06-2018 09:46:30.264 +0000 DEBUG KVStorageProvider - Delete data for key='nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_'
06-06-2018 09:46:30.264 +0000 DEBUG KVStorageProvider - Internal read query='{ "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", "_user" : "nobody" }' limit='1' skip='0' fields='{ "_id" : 0 }' sort='{ }'
06-06-2018 09:46:30.285 +0000 DEBUG KVStorageProvider - Internal read for query='{ "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", "_user" : "nobody" }' found data='[ { "runtimes" : [ 1.291 ], "sched_time" : 1528277400, "_user" : "nobody", "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_" } ]'
06-06-2018 09:46:30.285 +0000 DEBUG KVStorageProvider - Delete data with query='{ "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", "_user" : "nobody" }'
2018-06-06T09:46:30.286Z D QUERY [conn1249] Running query: query: { _key: "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", _user: "nobody" } sort: {} projection: { _id: 0 } skip: 0 limit: 1
2018-06-06T09:46:30.286Z D QUERY [conn1249] Only one plan is available; it will be run but will not be cached. query: { _key: "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", _user: "nobody" } sort: {} projection: { _id: 0 } skip: 0 limit: 1, planSummary: IXSCAN { _user: 1, _key: 1 }
2018-06-06T09:46:30.286Z I COMMAND [conn1249] query s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC.c query: { $query: { _key: "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", _user: "nobody" }, $orderby: {} } planSummary: IXSCAN { _user: 1, _key: 1 } ntoreturn:1 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:208 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms
06-06-2018 09:51:30.286 +0000 ERROR SavedSearchHistory - db_delete( "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_" ) failed: An error occurred during the last operation ('deleteData', domain: '2', code: '4'): Failed to send "delete" command with database "s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC": Failed to read 36 bytes from socket within 300000 milliseconds.
06-06-2018 09:51:30.286 +0000 ERROR KVStorageProvider - An error occurred during the last operation ('deleteData', domain: '2', code: '4'): Failed to send "delete" command with database "s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC": Failed to read 36 bytes from socket within 300000 milliseconds.
The mongod.log was showing messages like below
06-06-2018 10:18:13.457 +0000 WARN MongoDriver - stream: Failure to buffer 36 bytes: Failed to buffer 36 bytes within 300000 milliseconds.
06-06-2018 10:18:13.458 +0000 ERROR KVStorageProvider - An error occurred during the last operation ('deleteData', domain: '2', code: '4'): Failed to send "delete" command with database "s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC": Failed to read 36 bytes from socket within 300000 milliseconds.
The network trace showed Certain tcp streams which had many retransmissions and repeated duplicate acks though there seem to be missing packets. We can correlate these directly to mongo errors seen in splunkd.log