Concurrent searches are skipping with Search Head ...

rbal_splunk · ‎06-20-2018

We have 6 node Search head Cluster ( version 6.6.5), we are seeing . Since moving to SHC the Scheduled searches are either failing delegation, skipped and many in Continued status..

All the searches that are being skipped are due to error > The maximum number of concurrent running jobs for this historical scheduled search on this cluster has been reached

We have Splunk Deployed Azure cloud.

rbal_splunk · ‎06-20-2018

Looks like Azure SNAT process will drop the connection entries once the connections is idle for 4 minutes. This termination is not graceful, as it is not sending a TCP Reset to the connection end points. This behavior matches with the problematic TCP streams in the tcpdumps reviewed for the mongodb port 8191, across multiple SHs.

It seems, issue is possible if and only if the VM kernel's TCP keep alive timeout is higher than the Azure TCP Idle timeout. By the time VM stack starts sending keep alive, the SNAT already deleted the connection context. From the packet captures, we do see that the Splunk server instances are set with TCP keep-alive timeout value of 300 seconds. First keep-alive starts after 300 seconds.

Azure VM public IP tcp out settings is set to default 4 mins
VM kernel and the settings are as below [root@jetprodeastsh1 ~]# sysctl -a | grep net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 7200 [root@jetprodeastsh1 ~]# sysctl -a | grep net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_intvl = 75 [root@jetprodeastsh1 ~]# sysctl -a | grep net.ipv4.tcp_keepalive_probes net.ipv4.tcp_keepalive_probes = 9

This states that the VM’s kernel has the higher tcp timeout than at the VM public IP.
To resolve this issue we changed Change the net.ipv4.tcp_keepalive_time = 180

rbal_splunk · ‎06-20-2018

Here are some observation and resolution for this issue

We noticed errors in mongod.log and splunkd.log for Kvstore change indicated that log so we enabled DEBUG shows delete in Delete of ACCELERATE searches.

To debug this issue we enable DEBUG for

$HOME_SPLUNK/splunk set log-level KVStorageProvider -level DEBUG
$HOME_SPLUNK/splunk set log-level KVStoreConfigurationProvider -level DEBUG

To increase debugging for mongod we specified

[kvstore]
verbose = true

With the debug enabled logs we were able to trace timeouts to their originating actions, which as expected were deletions of ACCELERATE searches from mongo via SavedSearchHistory

06-06-2018 09:46:30.264 +0000 DEBUG KVStorageProvider - Delete data for key='nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_'
 06-06-2018 09:46:30.264 +0000 DEBUG KVStorageProvider - Internal read query='{ "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", "_user" : "nobody" }' limit='1' skip='0' fields='{ "_id" : 0 }' sort='{ }'
 06-06-2018 09:46:30.285 +0000 DEBUG KVStorageProvider - Internal read for query='{ "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", "_user" : "nobody" }' found data='[ { "runtimes" : [ 1.291 ], "sched_time" : 1528277400, "_user" : "nobody", "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_" } ]'
 06-06-2018 09:46:30.285 +0000 DEBUG KVStorageProvider - Delete data with query='{ "_key" : "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", "_user" : "nobody" }'
  2018-06-06T09:46:30.286Z D QUERY    [conn1249] Running query: query: { _key: "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", _user: "nobody" } sort: {} projection: { _id: 0 } skip: 0 limit: 1
  2018-06-06T09:46:30.286Z D QUERY    [conn1249] Only one plan is available; it will be run but will not be cached. query: { _key: "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", _user: "nobody" } sort: {} projection: { _id: 0 } skip: 0 limit: 1, planSummary: IXSCAN { _user: 1, _key: 1 }
  2018-06-06T09:46:30.286Z I COMMAND  [conn1249] query s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC.c query: { $query: { _key: "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_", _user: "nobody" }, $orderby: {} } planSummary: IXSCAN { _user: 1, _key: 1 } ntoreturn:1 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:208 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms
 06-06-2018 09:51:30.286 +0000 ERROR SavedSearchHistory - db_delete( "nobody;search;_ACCELERATE_EFE4643E-A7D0-4EB5-B688-624724C9CD14_search_xxt-admin_f11e1bbde0afa60a_ACCELERATE_" ) failed: An error occurred during the last operation ('deleteData', domain: '2', code: '4'): Failed to send "delete" command with database "s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC": Failed to read 36 bytes from socket within 300000 milliseconds.
 06-06-2018 09:51:30.286 +0000 ERROR KVStorageProvider - An error occurred during the last operation ('deleteData', domain: '2', code: '4'): Failed to send "delete" command with database "s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC": Failed to read 36 bytes from socket within 300000 milliseconds.

The mongod.log was showing messages like below

06-06-2018 10:18:13.457 +0000 WARN  MongoDriver - stream: Failure to buffer 36 bytes: Failed to buffer 36 bytes within 300000 milliseconds.
06-06-2018 10:18:13.458 +0000 ERROR KVStorageProvider - An error occurred during the last operation ('deleteData', domain: '2', code: '4'): Failed to send "delete" command with database "s_systemMX8edh8vqo2ngaR2K53MLFyt_SavedSrC2W6YtGw4V2Yg7Myvf8S3GC": Failed to read 36 bytes from socket within 300000 milliseconds.

The network trace showed Certain tcp streams which had many retransmissions and repeated duplicate acks though there seem to be missing packets. We can correlate these directly to mongo errors seen in splunkd.log

Concurrent searches are skipping with Search Head Cluster deployment on Azure

Index This | What are the 12 Days of Splunk-mas?

Get Inspired! We’ve Got Validation that Your Hard Work is Paying Off

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience