I am attempting to migrate my KV store to wiredTiger per https://docs.splunk.com/Documentation/Splunk/8.1.1/Admin/MigrateKVstore#Migrate_the_KV_store_after_a...
After running the migrate command, I get this error:
[ansible@splunk splunk]$ sudo ./bin/splunk migrate kvstore-storage-engine --target-engine wiredTiger
Starting KV Store storage engine upgrade:
Phase 1 (dump) of 2:
...............................................................................................
Phase 2 (restore) of 2:
Restoring data back to previous KV Store database
ERROR: Failed to migrate to storage engine wiredTiger, reason=KVStore service will not start because kvstore process terminated
Looking at my mongodb.log file, I see the following:
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] MongoDB starting : pid=4416 port=8191 dbpath=/opt/splunk/var/lib/splunk/kvstore/mongo 64-bit host=splunk
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] db version v3.6.17-linux-splunk-v4
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] git version: 226949cc252af265483afbf859b446590b09b098
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.2za-fips 24 Aug 2021
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] allocator: tcmalloc
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] modules: none
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] build environment:
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] distarch: x86_64
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] target_arch: x86_64
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] 3072 MB of memory available to the process out of 15854 MB total system memory
2021-12-27T00:43:57.647Z I CONTROL [initandlisten] options: { net: { bindIp: "0.0.0.0", port: 8191, ssl: { PEMKeyFile: "/opt/splunk/etc/auth/server.pem", PEMKeyPassword: "<password>", allowInvalidHostnames: true, disabledProtocols: "noTLS1_0,noTLS1_1", mode: "requireSSL", sslCipherConfig: "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RS..." }, unixDomainSocket: { enabled: false } }, replication: { oplogSizeMB: 200 }, security: { javascriptEnabled: false, keyFile: "/opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key" }, setParameter: { enableLocalhostAuthBypass: "0", oplogFetcherSteadyStateMaxFetcherRestarts: "0" }, storage: { dbPath: "/opt/splunk/var/lib/splunk/kvstore/mongo", engine: "mmapv1", mmapv1: { smallFiles: true } }, systemLog: { timeStampFormat: "iso8601-utc" } }
2021-12-27T00:43:57.664Z I JOURNAL [initandlisten] journal dir=/opt/splunk/var/lib/splunk/kvstore/mongo/journal
2021-12-27T00:43:57.664Z I JOURNAL [initandlisten] recover : no journal files present, no recovery needed
2021-12-27T00:43:57.948Z I JOURNAL [durability] Durability thread started
2021-12-27T00:43:57.948Z I JOURNAL [journal writer] Journal writer thread started
2021-12-27T00:43:57.949Z I CONTROL [initandlisten]
2021-12-27T00:43:57.949Z I CONTROL [initandlisten] ** WARNING: No SSL certificate validation can be performed since no CA file has been provided
2021-12-27T00:43:57.949Z I CONTROL [initandlisten] ** Please specify an sslCAFile parameter.
2021-12-27T00:43:57.949Z I CONTROL [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
2021-12-27T00:43:57.949Z I CONTROL [initandlisten]
2021-12-27T00:43:58.069Z I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/opt/splunk/var/lib/splunk/kvstore/mongo/diagnostic.data'
2021-12-27T00:43:58.100Z I STORAGE [initandlisten]
2021-12-27T00:43:58.100Z I STORAGE [initandlisten] ** WARNING: mongod started without --replSet yet 1 documents are present in local.system.replset
2021-12-27T00:43:58.100Z I STORAGE [initandlisten] ** Restart with --replSet unless you are doing maintenance and no other clients are connected.
2021-12-27T00:43:58.100Z I STORAGE [initandlisten] ** The TTL collection monitor will not start because of this.
2021-12-27T00:43:58.100Z I STORAGE [initandlisten] **
2021-12-27T00:43:58.100Z I STORAGE [initandlisten] For more info see http://dochub.mongodb.org/core/ttlcollections
2021-12-27T00:43:58.100Z I STORAGE [initandlisten]
2021-12-27T00:43:58.101Z I NETWORK [initandlisten] listening via socket bound to 0.0.0.0
2021-12-27T00:43:58.101Z I NETWORK [initandlisten] waiting for connections on port 8191 ssl
2021-12-27T00:43:58.575Z I NETWORK [listener] connection accepted from 127.0.0.1:51402 #1 (1 connection now open)
2021-12-27T00:43:58.582Z I NETWORK [conn1] received client metadata from 127.0.0.1:51402 conn1: { driver: { name: "mongoc", version: "1.16.2" }, os: { type: "Linux", name: "Red Hat Enterprise Linux", version: "8.5", architecture: "x86_64" }, platform: "cfg=0x00001620c9 posix=200112 stdc=201710 CC=GCC 9.1.0 CFLAGS="-g -fstack-protector-strong -static-libgcc -L/opt/splunk-home/lib/static-libstdc" LDFLA..." }
2021-12-27T00:43:58.599Z I ACCESS [conn1] Successfully authenticated as principal __system on local from client 127.0.0.1:51402
2021-12-27T00:43:58.599Z I NETWORK [conn1] end connection 127.0.0.1:51402 (0 connections now open)
mongodump 2021-12-26T17:43:59.640-0700 WARNING: --sslAllowInvalidCertificates and --sslAllowInvalidHostnames are deprecated, please use --tlsInsecure instead
2021-12-27T00:43:59.652Z I NETWORK [listener] connection accepted from 127.0.0.1:51404 #2 (1 connection now open)
2021-12-27T00:43:59.728Z I ACCESS [conn2] Successfully authenticated as principal __system on local from client 127.0.0.1:51404
2021-12-27T00:43:59.750Z I NETWORK [listener] connection accepted from 127.0.0.1:51406 #3 (2 connections now open)
2021-12-27T00:43:59.805Z I ACCESS [conn3] Successfully authenticated as principal __system on local from client 127.0.0.1:51406
mongodump 2021-12-26T17:44:00.073-0700 writing admin.system.indexes to
mongodump 2021-12-26T17:44:00.075-0700 done dumping admin.system.indexes (2 documents)
mongodump 2021-12-26T17:44:00.075-0700 writing config.system.indexes to
mongodump 2021-12-26T17:44:00.077-0700 done dumping config.system.indexes (3 documents)
mongodump 2021-12-26T17:44:00.077-0700 writing admin.system.version to
mongodump 2021-12-26T17:44:00.079-0700 done dumping admin.system.version (1 document)
... a whole bunch of other dumps completing...
mongodump 2021-12-26T17:44:00.635-0700 done dumping s_Splunk5+n+0jIfNWH9x+qdy7cD4GTT_sse_jse2D8rEiNk5kfRO1HbJ@VAjMp.c (10 documents)
2021-12-27T00:44:00.635Z I NETWORK [conn2] end connection 127.0.0.1:51404 (3 connections now open)
2021-12-27T00:44:00.635Z I NETWORK [conn3] end connection 127.0.0.1:51406 (2 connections now open)
2021-12-27T00:44:00.636Z I NETWORK [conn5] end connection 127.0.0.1:51410 (1 connection now open)
2021-12-27T00:44:00.636Z I NETWORK [conn4] end connection 127.0.0.1:51408 (0 connections now open)
2021-12-27T00:44:00.671Z I NETWORK [listener] connection accepted from 127.0.0.1:51412 #6 (1 connection now open)
2021-12-27T00:44:00.676Z I NETWORK [conn6] received client metadata from 127.0.0.1:51412 conn6: { driver: { name: "mongoc", version: "1.16.2" }, os: { type: "Linux", name: "Red Hat Enterprise Linux", version: "8.5", architecture: "x86_64" }, platform: "cfg=0x00001620c9 posix=200112 stdc=201710 CC=GCC 9.1.0 CFLAGS="-g -fstack-protector-strong -static-libgcc -L/opt/splunk-home/lib/static-libstdc" LDFLA..." }
2021-12-27T00:44:00.676Z I NETWORK [listener] connection accepted from 127.0.0.1:51414 #7 (2 connections now open)
2021-12-27T00:44:00.682Z I NETWORK [conn7] received client metadata from 127.0.0.1:51414 conn7: { driver: { name: "mongoc", version: "1.16.2" }, os: { type: "Linux", name: "Red Hat Enterprise Linux", version: "8.5", architecture: "x86_64" }, platform: "cfg=0x00001620c9 posix=200112 stdc=201710 CC=GCC 9.1.0 CFLAGS="-g -fstack-protector-strong -static-libgcc -L/opt/splunk-home/lib/static-libstdc" LDFLA..." }
2021-12-27T00:44:00.699Z I ACCESS [conn7] Successfully authenticated as principal __system on local from client 127.0.0.1:51414
2021-12-27T00:44:00.723Z I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2021-12-27T00:44:00.724Z I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
2021-12-27T00:44:00.724Z I FTDC [signalProcessingThread] Shutting down full-time diagnostic data capture
2021-12-27T00:44:00.726Z I STORAGE [signalProcessingThread] shutdown: waiting for fs preallocator...
2021-12-27T00:44:00.726Z I STORAGE [signalProcessingThread] shutdown: final commit...
2021-12-27T00:44:00.729Z I JOURNAL [signalProcessingThread] journalCleanup...
2021-12-27T00:44:00.729Z I JOURNAL [signalProcessingThread] removeJournalFiles
2021-12-27T00:44:00.729Z I JOURNAL [signalProcessingThread] old journal file will be removed: /opt/splunk/var/lib/splunk/kvstore/mongo/journal/j._0
2021-12-27T00:44:00.730Z I JOURNAL [signalProcessingThread] Terminating durability thread ...
2021-12-27T00:44:00.828Z I JOURNAL [journal writer] Journal writer thread stopped
2021-12-27T00:44:00.828Z I JOURNAL [durability] Durability thread stopped
2021-12-27T00:44:00.828Z I STORAGE [signalProcessingThread] shutdown: closing all files...
2021-12-27T00:44:00.855Z I STORAGE [signalProcessingThread] closeAllFiles() finished
2021-12-27T00:44:00.855Z I STORAGE [signalProcessingThread] shutdown: removing fs lock...
2021-12-27T00:44:00.855Z I CONTROL [signalProcessingThread] now exiting
2021-12-27T00:44:00.856Z I CONTROL [signalProcessingThread] shutting down with code:0
I've seen some other errors reported with this process, but they all seem to be related to file permission errors. My file permissions seem OK, and give the dumb of existing data works, doesn't seem to be related anyhow. Any other ideas of what is wrong here?
I've fixed my issue, but I still don't understand what was wrong. I spent several hours troubleshooting and here is what I did/found:
I created a new container with the same settings. I copied my existing persistent $SPLUNK_HOME/var and $SPLUNK_HOME/etc data to new folders and pointed this new container at those. This gave me a mirror replica of my production instance I could play with. I went through the same migration attempt in this new container and ran into the same problem. I then started modifying the container to remove as much customization as I could. I removed the persistent $SPLUNK_HOME/etc mapped volume, reset the container, etc. I got all the way to a completely stock Splunk container with only $SPLUNK_HOME/var/lib/splunk/kvstore/mongo mapped to my persistent data and this still did not work. If I instead did the inverse and mapped everything except $SPLUNK_HOME/var/lib/splunk/kvstore/mongo (effectively, have Splunk create a new "mongo" folder and kvstore database on it's next startup) the migration would work fine. So this seemed to indicate the issue was with my database files.... So I went back to my original setup and tried doing a ./splunk clean kvstore –local which cleared out the entire "mongo" directory (confiremed with a ls of the directory). I restarted splunk, it created a new kvstore, and then I tried the migration again....and again it failed. I don't understand this at all. The issue was narrowed down to being related to my mongo files, but after deleting those files and allowing the kvstore to be rebuilt fresh, it still refused to work.
Anyhow, I ended up using "KV Store Tools" app to backup my kvstore collections from my prod instance to files and then stopped my prod Splunk instance and deleted the "mongo" folder. I then fired-up a brand new, completely unmodified Splunk container, performed a migration of the default kvstore, scp'ed the "mongo" directory off that container to my prod instance. Back on my prod instance I confirmed all the permissions on the new "mongo" folder were correct, added the storageEngine=wiredTiger line to my [kvstore] stanza in server.conf, and started my prod Splunk instance back up. It fired right up and appeared to be working (no kvstore errors reported). A splunk show kvstore-status reports mongo is running with wiredTiger. Lastly, I used the "KV Store Tools" app to restore my kvstore collections from the backup files.
So....my prod instance is now on wiredTiger, but i have no idea what the original problem was. If anyone has an idea, I'd love to hear it.
I've fixed my issue, but I still don't understand what was wrong. I spent several hours troubleshooting and here is what I did/found:
I created a new container with the same settings. I copied my existing persistent $SPLUNK_HOME/var and $SPLUNK_HOME/etc data to new folders and pointed this new container at those. This gave me a mirror replica of my production instance I could play with. I went through the same migration attempt in this new container and ran into the same problem. I then started modifying the container to remove as much customization as I could. I removed the persistent $SPLUNK_HOME/etc mapped volume, reset the container, etc. I got all the way to a completely stock Splunk container with only $SPLUNK_HOME/var/lib/splunk/kvstore/mongo mapped to my persistent data and this still did not work. If I instead did the inverse and mapped everything except $SPLUNK_HOME/var/lib/splunk/kvstore/mongo (effectively, have Splunk create a new "mongo" folder and kvstore database on it's next startup) the migration would work fine. So this seemed to indicate the issue was with my database files.... So I went back to my original setup and tried doing a ./splunk clean kvstore –local which cleared out the entire "mongo" directory (confiremed with a ls of the directory). I restarted splunk, it created a new kvstore, and then I tried the migration again....and again it failed. I don't understand this at all. The issue was narrowed down to being related to my mongo files, but after deleting those files and allowing the kvstore to be rebuilt fresh, it still refused to work.
Anyhow, I ended up using "KV Store Tools" app to backup my kvstore collections from my prod instance to files and then stopped my prod Splunk instance and deleted the "mongo" folder. I then fired-up a brand new, completely unmodified Splunk container, performed a migration of the default kvstore, scp'ed the "mongo" directory off that container to my prod instance. Back on my prod instance I confirmed all the permissions on the new "mongo" folder were correct, added the storageEngine=wiredTiger line to my [kvstore] stanza in server.conf, and started my prod Splunk instance back up. It fired right up and appeared to be working (no kvstore errors reported). A splunk show kvstore-status reports mongo is running with wiredTiger. Lastly, I used the "KV Store Tools" app to restore my kvstore collections from the backup files.
So....my prod instance is now on wiredTiger, but i have no idea what the original problem was. If anyone has an idea, I'd love to hear it.
Hi @rtadams89
ERROR: Failed to migrate to storage engine wiredTiger, reason=KVStore service will not start because kvstore process terminated
KVStore service will not start issue is a complex one and a long troubleshooting may be required.
1. To begin with.. lets check the file permissions of:
ls -l /opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key
2. Pls provide some details of SSL Certificates.. are you using default one or 3rd party created SSL certs, thanks
Permissions on splunk.key are as expected:
-r-------- 1 splunk splunk 88 Oct 29 2020 /opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key
Splunk web is using https with my own cert. All other certs are default/unchanged.
as u r running this thru ansible, may i know if u have a simple test/dev system, if yes, pls try replicating this task in the test/dev system.
i have worked on wiredtiger migration and it worked like a charm.
the kvstore process didnt start can be troublesome issue at times.
I'm running the Splunk Docker image (https://registry.hub.docker.com/r/splunk/splunk/) latest tag.
I just deployed a new container, with all the default settings (other than setting a Splunk password as required) and attempted the wireTiger migration. This works exactly as described so there has to be something unique to my production Docker image causing the issue. I have $SPLUNK_HOME/etc and $SPLUNK_HOME/var mapped to persistent volumes, but the production Docker image itself has been reset to defaults. So either something is broken with the mongodb files in $SPLUNK_HOME/etc or I have a configuration file that Splunk is unhappy with.
the splunk docker at github page link is this one:
https://github.com/splunk/docker-splunk
i would suggest you to raise as an issue at the github issues tracker, and there the developers could see your issue and reply accordingly to you.
though there are lot of docker splunk admins here, the original dev guys may not be active here in Splunk Community. hope you got my point, thanks.
I don't think it is an issue with the Docker image, as my test shows the migration to wiredTiger works fine with the default image. It isn't until I mount my persistent $SPLUNK_HOME/etc and $SPLUNK_HOME/var directors that the migration will fail. It has to be something with either my mongodb files or my config files.