We are running Splunk 6.5.1, and on one of our standalone search heads, upon every restart of splunkd we get the following message:
KV Store changed status to failed. KVStore process terminated.
KV Store process terminated abnormally (exit code 14, status exited with code 14). See mongod.log and splunkd.log for details.
And further, in mongod.log:
2017-01-27T06:57:06.836Z I NETWORK FIPS 140-2 mode activated
2017-01-27T06:57:06.894Z I CONTROL [initandlisten] MongoDB starting : pid=1852 port=8191 dbpath=C:\Program Files\Splunk\var\lib\splunk\kvstore\mongo 64-bit host=FS300
2017-01-27T06:57:06.894Z I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2017-01-27T06:57:06.894Z I CONTROL [initandlisten] db version v3.0.8-splunk
2017-01-27T06:57:06.894Z I CONTROL [initandlisten] git version: 83d8cc25e00e42856924d84e220fbe4a839e605d
2017-01-27T06:57:06.894Z I CONTROL [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
2017-01-27T06:57:06.894Z I CONTROL [initandlisten] allocator: tcmalloc
2017-01-27T06:57:06.894Z I CONTROL [initandlisten] options: { net: { port: 8191, ssl: { CAFile: "C:\Program Files\Splunk\etc\auth\mycerts\cacert.cer", FIPSMode: true, PEMKeyFile: "C:\Program Files\Splunk\etc\auth\mycerts\server.pem", PEMKeyPassword: "<password>", allowInvalidHostnames: true, mode: "preferSSL" } }, replication: { oplogSizeMB: 200, replSet: "E3DEAFEB-A8FE-4C69-8C24-6BD8AE669663" }, security: { clusterAuthMode: "sendX509", javascriptEnabled: false, keyFile: "C:\Program Files\Splunk\var\lib\splunk\kvstore\mongo\splunk.key" }, setParameter: { enableLocalhostAuthBypass: "0" }, storage: { dbPath: "C:\Program Files\Splunk\var\lib\splunk\kvstore\mongo", mmapv1: { smallFiles: true } }, systemLog: { timeStampFormat: "iso8601-utc" } }
2017-01-27T06:57:06.920Z I JOURNAL [initandlisten] journal dir=C:\Program Files\Splunk\var\lib\splunk\kvstore\mongo\journal
2017-01-27T06:57:06.920Z I JOURNAL [initandlisten] recover begin
2017-01-27T06:57:06.920Z I JOURNAL [initandlisten] recover lsn: 0
2017-01-27T06:57:06.920Z I JOURNAL [initandlisten] recover C:\Program Files\Splunk\var\lib\splunk\kvstore\mongo\journal\j._0
2017-01-27T06:57:06.934Z I JOURNAL [initandlisten] recover cleaning up
2017-01-27T06:57:06.934Z I JOURNAL [initandlisten] removeJournalFiles
2017-01-27T06:57:06.934Z I JOURNAL [initandlisten] recover done
2017-01-27T06:57:06.998Z I JOURNAL [durability] Durability thread started
2017-01-27T06:57:06.999Z I JOURNAL [journal writer] Journal writer thread started
2017-01-27T06:57:07.507Z I NETWORK [initandlisten] waiting for connections on port 8191 ssl
2017-01-27T06:57:07.509Z I REPL [ReplicationExecutor] New replica set config in use: { _id: "E3DEAFEB-A8FE-4C69-8C24-6BD8AE669663", version: 1, members: [ { _id: 0, host: "127.0.0.1:8191", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { instance: "E3DEAFEB-A8FE-4C69-8C24-6BD8AE669663", all: "all" }, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2017-01-27T06:57:07.509Z I REPL [ReplicationExecutor] This node is 127.0.0.1:8191 in the config
2017-01-27T06:57:07.509Z I REPL [ReplicationExecutor] transition to STARTUP2
2017-01-27T06:57:07.509Z I REPL [ReplicationExecutor] Starting replication applier threads
2017-01-27T06:57:07.510Z I REPL [ReplicationExecutor] transition to RECOVERING
2017-01-27T06:57:07.511Z I REPL [ReplicationExecutor] transition to SECONDARY
2017-01-27T06:57:07.512Z I REPL [ReplicationExecutor] transition to PRIMARY
2017-01-27T06:57:07.567Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50653 #1 (1 connection now open)
2017-01-27T06:57:07.784Z I ACCESS [conn1] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=email@email.com,CN=searchhead.abc.com,OU=OU,O=OrgName,L=City,ST=state,C=US" }
2017-01-27T06:57:07.784Z I NETWORK [conn1] end connection 127.0.0.1:50653 (0 connections now open)
2017-01-27T06:57:08.557Z I REPL [rsSync] transition to primary complete; database writes are now permitted
2017-01-27T06:57:08.786Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50679 #2 (1 connection now open)
2017-01-27T06:57:09.027Z I ACCESS [conn2] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=email@email.com,CN=searchhead.abc.com,OU=OU,O=OrgName,L=City,ST=state,C=US" }
2017-01-27T06:57:09.027Z I NETWORK [conn2] end connection 127.0.0.1:50679 (0 connections now open)
2017-01-27T06:57:09.028Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50680 #3 (1 connection now open)
2017-01-27T06:57:09.252Z I ACCESS [conn3] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=email@email.com,CN=searchhead.abc.com,OU=OU,O=OrgName,L=City,ST=state,C=US" }
2017-01-27T06:57:09.254Z I NETWORK [conn3] end connection 127.0.0.1:50680 (0 connections now open)
2017-01-27T06:57:09.255Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50681 #4 (1 connection now open)
2017-01-27T06:57:09.471Z I ACCESS [conn4] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=email@email.com,CN=searchhead.abc.com,OU=OU,O=OrgName,L=City,ST=state,C=US" }
2017-01-27T06:57:09.472Z I NETWORK [conn4] end connection 127.0.0.1:50681 (0 connections now open)
2017-01-27T06:57:09.473Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50683 #5 (1 connection now open)
2017-01-27T06:57:09.693Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50684 #6 (2 connections now open)
2017-01-27T06:57:09.916Z I ACCESS [conn6] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=email@email.com,CN=searchhead.abc.com,OU=OU,O=OrgName,L=City,ST=state,C=US" }
2017-01-27T06:57:10.759Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50689 #7 (3 connections now open)
2017-01-27T06:57:10.974Z I NETWORK [initandlisten] connection accepted from 127.0.0.1:50691 #8 (4 connections now open)
2017-01-27T06:57:11.189Z I ACCESS [conn8] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "emailAddress=email@email.com,CN=searchhead.abc.com,OU=OU,O=OrgName,L=City,ST=state,C=US" }
2017-01-27T06:57:11.204Z I - [conn6] Assertion: 17322:write to oplog failed: DocTooLargeForCapp
2017-01-27T06:59:34.918Z I - [conn6] Assertion: 17322:write to oplog failed: DocTooLargeForCapped document doesn't fit in capped collection. size: 160 storageSize:209715200 @ 28575
2017-01-27T06:59:34.972Z I CONTROL [conn6] mongod.exe index_collator_extension+0x1465d3
2017-01-27T06:59:34.972Z I CONTROL [conn6] mongod.exe index_collator_extension+0xfd72f
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe index_collator_extension+0xf05ae
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe index_collator_extension+0xf04d3
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe index_collator_extension+0x44fb88
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe index_collator_extension+0x10a343
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe index_collator_extension+0x166c01
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe index_collator_extension+0x47e72b
2017-01-27T06:59:34.973Z I CONTROL [conn6] mongod.exe index_collator_extension+0x47e8d2
2017-01-27T06:59:34.973Z I CONTROL [conn6] KERNEL32.DLL BaseThreadInitThunk+0x22
2017-01-27T06:59:34.973Z I CONTROL [conn6]
2017-01-27T06:59:36.029Z W STORAGE [conn6] couldn't make room for record len: 160 in capped ns local.oplog.rs
numRecords: 566262
Extent 0 (capExtent) 1:2000
magic: 41424344 extent->ns: local.oplog.rs
fr: null lr: 1:c3fbd58 extent->len: 209715200
2017-01-27T06:59:36.029Z I - [conn6] Fatal Assertion 17438
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0x1465d3
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0xfd72f
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0xefec7
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe ???
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0x44fb88
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0x10a343
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0x166c01
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0x47e72b
2017-01-27T06:59:36.080Z I CONTROL [conn6] mongod.exe index_collator_extension+0x47e8d2
2017-01-27T06:59:36.080Z I CONTROL [conn6] KERNEL32.DLL BaseThreadInitThunk+0x22
2017-01-27T06:59:36.080Z I CONTROL [conn6]
2017-01-27T06:59:36.080Z I - [conn6]
***aborting after fassert() failure
It's been condensed a bit, but essentially there's a lot of spamming of the ???, index_collator and the assertion errors before it finally gives up and shuts down. This appears to have happened after replacing our SSL certs, but we have an ES Search Head with the same config (different cert), and the KV_Store is functioning normally.
We aren't actually using KV Store for anything on this Search Head yet, and we don't have a lot of experience with it. But, I'm not afraid to do a complete re initialization if that's what it takes to get it functioning again.
Any ideas on what's causing this and how to resolve it?
Increasing [kvstore]/oplogSize in server.conf to 2000MB (default: 1000), and cleaning the kvstore while I was at it resolved this issue.
Once I read the default was 1000, I took a look at my kvstore directory again and saw all of the ".0" files, which I'm assuming is the primary KV mongo db. The size of all these files put together was about 1037MB -- just over the default cap set in server.conf.
I'm pretty confident just increasing the size would resolve it but I wanted to make sure it was done 🙂
More info:
"Prevent stale members by increasing operations log size"
https://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/ResyncKVstore
Increasing [kvstore]/oplogSize in server.conf to 2000MB (default: 1000), and cleaning the kvstore while I was at it resolved this issue.
Once I read the default was 1000, I took a look at my kvstore directory again and saw all of the ".0" files, which I'm assuming is the primary KV mongo db. The size of all these files put together was about 1037MB -- just over the default cap set in server.conf.
I'm pretty confident just increasing the size would resolve it but I wanted to make sure it was done 🙂
More info:
"Prevent stale members by increasing operations log size"
https://docs.splunk.com/Documentation/Splunk/6.5.1/Admin/ResyncKVstore
Resyncing KV Store resolved our issue. Thank you for including that link.