Solved: Error migrating to wiredTiger

rtadams89 · ‎12-26-2021

I am attempting to migrate my KV store to wiredTiger per https://docs.splunk.com/Documentation/Splunk/8.1.1/Admin/MigrateKVstore#Migrate_the_KV_store_after_a...

After running the migrate command, I get this error:

[ansible@splunk splunk]$ sudo ./bin/splunk migrate kvstore-storage-engine --target-engine wiredTiger                                                                                                        
Starting KV Store storage engine upgrade:                                                                                                                                                                   
Phase 1 (dump) of 2:                                                                                                                                                                                        
...............................................................................................                                                                                                             
Phase 2 (restore) of 2:                                                                                                                                                                                     
                                                                                                                                                                                                            
Restoring data back to previous KV Store database                                                                                                                                                           
ERROR: Failed to migrate to storage engine wiredTiger, reason=KVStore service will not start because kvstore process terminated

Looking at my mongodb.log file, I see the following:

  2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] MongoDB starting : pid=4416 port=8191 dbpath=/opt/splunk/var/lib/splunk/kvstore/mongo 64-bit host=splunk
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] db version v3.6.17-linux-splunk-v4
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] git version: 226949cc252af265483afbf859b446590b09b098
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.2za-fips  24 Aug 2021
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] allocator: tcmalloc
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] modules: none
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] build environment:
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten]     distarch: x86_64
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten]     target_arch: x86_64
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] 3072 MB of memory available to the process out of 15854 MB total system memory
 2021-12-27T00:43:57.647Z I CONTROL  [initandlisten] options: { net: { bindIp: "0.0.0.0", port: 8191, ssl: { PEMKeyFile: "/opt/splunk/etc/auth/server.pem", PEMKeyPassword: "<password>", allowInvalidHostnames: true, disabledProtocols: "noTLS1_0,noTLS1_1", mode: "requireSSL", sslCipherConfig: "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RS..." }, unixDomainSocket: { enabled: false } }, replication: { oplogSizeMB: 200 }, security: { javascriptEnabled: false, keyFile: "/opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key" }, setParameter: { enableLocalhostAuthBypass: "0", oplogFetcherSteadyStateMaxFetcherRestarts: "0" }, storage: { dbPath: "/opt/splunk/var/lib/splunk/kvstore/mongo", engine: "mmapv1", mmapv1: { smallFiles: true } }, systemLog: { timeStampFormat: "iso8601-utc" } }
 2021-12-27T00:43:57.664Z I JOURNAL  [initandlisten] journal dir=/opt/splunk/var/lib/splunk/kvstore/mongo/journal
 2021-12-27T00:43:57.664Z I JOURNAL  [initandlisten] recover : no journal files present, no recovery needed
 2021-12-27T00:43:57.948Z I JOURNAL  [durability] Durability thread started
 2021-12-27T00:43:57.948Z I JOURNAL  [journal writer] Journal writer thread started
 2021-12-27T00:43:57.949Z I CONTROL  [initandlisten] 
 2021-12-27T00:43:57.949Z I CONTROL  [initandlisten] ** WARNING: No SSL certificate validation can be performed since no CA file has been provided
 2021-12-27T00:43:57.949Z I CONTROL  [initandlisten] **          Please specify an sslCAFile parameter.
 2021-12-27T00:43:57.949Z I CONTROL  [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
 2021-12-27T00:43:57.949Z I CONTROL  [initandlisten] 
 2021-12-27T00:43:58.069Z I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/opt/splunk/var/lib/splunk/kvstore/mongo/diagnostic.data'
 2021-12-27T00:43:58.100Z I STORAGE  [initandlisten] 
 2021-12-27T00:43:58.100Z I STORAGE  [initandlisten] ** WARNING: mongod started without --replSet yet 1 documents are present in local.system.replset
 2021-12-27T00:43:58.100Z I STORAGE  [initandlisten] **          Restart with --replSet unless you are doing maintenance and  no other clients are connected.
 2021-12-27T00:43:58.100Z I STORAGE  [initandlisten] **          The TTL collection monitor will not start because of this.
 2021-12-27T00:43:58.100Z I STORAGE  [initandlisten] **         
 2021-12-27T00:43:58.100Z I STORAGE  [initandlisten]  For more info see http://dochub.mongodb.org/core/ttlcollections
 2021-12-27T00:43:58.100Z I STORAGE  [initandlisten] 
 2021-12-27T00:43:58.101Z I NETWORK  [initandlisten] listening via socket bound to 0.0.0.0
 2021-12-27T00:43:58.101Z I NETWORK  [initandlisten] waiting for connections on port 8191 ssl
 2021-12-27T00:43:58.575Z I NETWORK  [listener] connection accepted from 127.0.0.1:51402 #1 (1 connection now open)
 2021-12-27T00:43:58.582Z I NETWORK  [conn1] received client metadata from 127.0.0.1:51402 conn1: { driver: { name: "mongoc", version: "1.16.2" }, os: { type: "Linux", name: "Red Hat Enterprise Linux", version: "8.5", architecture: "x86_64" }, platform: "cfg=0x00001620c9 posix=200112 stdc=201710 CC=GCC 9.1.0 CFLAGS="-g -fstack-protector-strong -static-libgcc -L/opt/splunk-home/lib/static-libstdc" LDFLA..." }
 2021-12-27T00:43:58.599Z I ACCESS   [conn1] Successfully authenticated as principal __system on local from client 127.0.0.1:51402
 2021-12-27T00:43:58.599Z I NETWORK  [conn1] end connection 127.0.0.1:51402 (0 connections now open)
mongodump 2021-12-26T17:43:59.640-0700	WARNING: --sslAllowInvalidCertificates and --sslAllowInvalidHostnames are deprecated, please use --tlsInsecure instead
 2021-12-27T00:43:59.652Z I NETWORK  [listener] connection accepted from 127.0.0.1:51404 #2 (1 connection now open)
 2021-12-27T00:43:59.728Z I ACCESS   [conn2] Successfully authenticated as principal __system on local from client 127.0.0.1:51404
 2021-12-27T00:43:59.750Z I NETWORK  [listener] connection accepted from 127.0.0.1:51406 #3 (2 connections now open)
 2021-12-27T00:43:59.805Z I ACCESS   [conn3] Successfully authenticated as principal __system on local from client 127.0.0.1:51406
mongodump 2021-12-26T17:44:00.073-0700	writing admin.system.indexes to 
mongodump 2021-12-26T17:44:00.075-0700	done dumping admin.system.indexes (2 documents)
mongodump 2021-12-26T17:44:00.075-0700	writing config.system.indexes to 
mongodump 2021-12-26T17:44:00.077-0700	done dumping config.system.indexes (3 documents)
mongodump 2021-12-26T17:44:00.077-0700	writing admin.system.version to 
mongodump 2021-12-26T17:44:00.079-0700	done dumping admin.system.version (1 document)

... a whole bunch of other dumps completing...

mongodump 2021-12-26T17:44:00.635-0700	done dumping s_Splunk5+n+0jIfNWH9x+qdy7cD4GTT_sse_jse2D8rEiNk5kfRO1HbJ@VAjMp.c (10 documents)
 2021-12-27T00:44:00.635Z I NETWORK  [conn2] end connection 127.0.0.1:51404 (3 connections now open)
 2021-12-27T00:44:00.635Z I NETWORK  [conn3] end connection 127.0.0.1:51406 (2 connections now open)
 2021-12-27T00:44:00.636Z I NETWORK  [conn5] end connection 127.0.0.1:51410 (1 connection now open)
 2021-12-27T00:44:00.636Z I NETWORK  [conn4] end connection 127.0.0.1:51408 (0 connections now open)
 2021-12-27T00:44:00.671Z I NETWORK  [listener] connection accepted from 127.0.0.1:51412 #6 (1 connection now open)
 2021-12-27T00:44:00.676Z I NETWORK  [conn6] received client metadata from 127.0.0.1:51412 conn6: { driver: { name: "mongoc", version: "1.16.2" }, os: { type: "Linux", name: "Red Hat Enterprise Linux", version: "8.5", architecture: "x86_64" }, platform: "cfg=0x00001620c9 posix=200112 stdc=201710 CC=GCC 9.1.0 CFLAGS="-g -fstack-protector-strong -static-libgcc -L/opt/splunk-home/lib/static-libstdc" LDFLA..." }
 2021-12-27T00:44:00.676Z I NETWORK  [listener] connection accepted from 127.0.0.1:51414 #7 (2 connections now open)
 2021-12-27T00:44:00.682Z I NETWORK  [conn7] received client metadata from 127.0.0.1:51414 conn7: { driver: { name: "mongoc", version: "1.16.2" }, os: { type: "Linux", name: "Red Hat Enterprise Linux", version: "8.5", architecture: "x86_64" }, platform: "cfg=0x00001620c9 posix=200112 stdc=201710 CC=GCC 9.1.0 CFLAGS="-g -fstack-protector-strong -static-libgcc -L/opt/splunk-home/lib/static-libstdc" LDFLA..." }
 2021-12-27T00:44:00.699Z I ACCESS   [conn7] Successfully authenticated as principal __system on local from client 127.0.0.1:51414
 2021-12-27T00:44:00.723Z I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
 2021-12-27T00:44:00.724Z I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
 2021-12-27T00:44:00.724Z I FTDC     [signalProcessingThread] Shutting down full-time diagnostic data capture
 2021-12-27T00:44:00.726Z I STORAGE  [signalProcessingThread] shutdown: waiting for fs preallocator...
 2021-12-27T00:44:00.726Z I STORAGE  [signalProcessingThread] shutdown: final commit...
 2021-12-27T00:44:00.729Z I JOURNAL  [signalProcessingThread] journalCleanup...
 2021-12-27T00:44:00.729Z I JOURNAL  [signalProcessingThread] removeJournalFiles
 2021-12-27T00:44:00.729Z I JOURNAL  [signalProcessingThread] old journal file will be removed: /opt/splunk/var/lib/splunk/kvstore/mongo/journal/j._0
 2021-12-27T00:44:00.730Z I JOURNAL  [signalProcessingThread] Terminating durability thread ...
 2021-12-27T00:44:00.828Z I JOURNAL  [journal writer] Journal writer thread stopped
 2021-12-27T00:44:00.828Z I JOURNAL  [durability] Durability thread stopped
 2021-12-27T00:44:00.828Z I STORAGE  [signalProcessingThread] shutdown: closing all files...
 2021-12-27T00:44:00.855Z I STORAGE  [signalProcessingThread] closeAllFiles() finished
 2021-12-27T00:44:00.855Z I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
 2021-12-27T00:44:00.855Z I CONTROL  [signalProcessingThread] now exiting
 2021-12-27T00:44:00.856Z I CONTROL  [signalProcessingThread] shutting down with code:0

I've seen some other errors reported with this process, but they all seem to be related to file permission errors. My file permissions seem OK, and give the dumb of existing data works, doesn't seem to be related anyhow. Any other ideas of what is wrong here?

rtadams89 · ‎12-27-2021

I've fixed my issue, but I still don't understand what was wrong. I spent several hours troubleshooting and here is what I did/found:

I created a new container with the same settings. I copied my existing persistent $SPLUNK_HOME/var and $SPLUNK_HOME/etc data to new folders and pointed this new container at those. This gave me a mirror replica of my production instance I could play with. I went through the same migration attempt in this new container and ran into the same problem. I then started modifying the container to remove as much customization as I could. I removed the persistent $SPLUNK_HOME/etc mapped volume, reset the container, etc. I got all the way to a completely stock Splunk container with only $SPLUNK_HOME/var/lib/splunk/kvstore/mongo mapped to my persistent data and this still did not work. If I instead did the inverse and mapped everything except $SPLUNK_HOME/var/lib/splunk/kvstore/mongo (effectively, have Splunk create a new "mongo" folder and kvstore database on it's next startup) the migration would work fine. So this seemed to indicate the issue was with my database files.... So I went back to my original setup and tried doing a ./splunk clean kvstore –local which cleared out the entire "mongo" directory (confiremed with a ls of the directory). I restarted splunk, it created a new kvstore, and then I tried the migration again....and again it failed. I don't understand this at all. The issue was narrowed down to being related to my mongo files, but after deleting those files and allowing the kvstore to be rebuilt fresh, it still refused to work.

Anyhow, I ended up using "KV Store Tools" app to backup my kvstore collections from my prod instance to files and then stopped my prod Splunk instance and deleted the "mongo" folder. I then fired-up a brand new, completely unmodified Splunk container, performed a migration of the default kvstore, scp'ed the "mongo" directory off that container to my prod instance. Back on my prod instance I confirmed all the permissions on the new "mongo" folder were correct, added the storageEngine=wiredTiger line to my [kvstore] stanza in server.conf, and started my prod Splunk instance back up. It fired right up and appeared to be working (no kvstore errors reported). A splunk show kvstore-status reports mongo is running with wiredTiger. Lastly, I used the "KV Store Tools" app to restore my kvstore collections from the backup files.

So....my prod instance is now on wiredTiger, but i have no idea what the original problem was. If anyone has an idea, I'd love to hear it.

View solution in original post

rtadams89 · ‎12-27-2021

I've fixed my issue, but I still don't understand what was wrong. I spent several hours troubleshooting and here is what I did/found:

I created a new container with the same settings. I copied my existing persistent $SPLUNK_HOME/var and $SPLUNK_HOME/etc data to new folders and pointed this new container at those. This gave me a mirror replica of my production instance I could play with. I went through the same migration attempt in this new container and ran into the same problem. I then started modifying the container to remove as much customization as I could. I removed the persistent $SPLUNK_HOME/etc mapped volume, reset the container, etc. I got all the way to a completely stock Splunk container with only $SPLUNK_HOME/var/lib/splunk/kvstore/mongo mapped to my persistent data and this still did not work. If I instead did the inverse and mapped everything except $SPLUNK_HOME/var/lib/splunk/kvstore/mongo (effectively, have Splunk create a new "mongo" folder and kvstore database on it's next startup) the migration would work fine. So this seemed to indicate the issue was with my database files.... So I went back to my original setup and tried doing a ./splunk clean kvstore –local which cleared out the entire "mongo" directory (confiremed with a ls of the directory). I restarted splunk, it created a new kvstore, and then I tried the migration again....and again it failed. I don't understand this at all. The issue was narrowed down to being related to my mongo files, but after deleting those files and allowing the kvstore to be rebuilt fresh, it still refused to work.

Anyhow, I ended up using "KV Store Tools" app to backup my kvstore collections from my prod instance to files and then stopped my prod Splunk instance and deleted the "mongo" folder. I then fired-up a brand new, completely unmodified Splunk container, performed a migration of the default kvstore, scp'ed the "mongo" directory off that container to my prod instance. Back on my prod instance I confirmed all the permissions on the new "mongo" folder were correct, added the storageEngine=wiredTiger line to my [kvstore] stanza in server.conf, and started my prod Splunk instance back up. It fired right up and appeared to be working (no kvstore errors reported). A splunk show kvstore-status reports mongo is running with wiredTiger. Lastly, I used the "KV Store Tools" app to restore my kvstore collections from the backup files.

So....my prod instance is now on wiredTiger, but i have no idea what the original problem was. If anyone has an idea, I'd love to hear it.

inventsekar · ‎12-26-2021

Hi @rtadams89

ERROR: Failed to migrate to storage engine wiredTiger, reason=KVStore service will not start because kvstore process terminated

KVStore service will not start issue is a complex one and a long troubleshooting may be required.

1. To begin with.. lets check the file permissions of:

ls -l /opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key

2. Pls provide some details of SSL Certificates.. are you using default one or 3rd party created SSL certs, thanks

rtadams89 · ‎12-26-2021

Permissions on splunk.key are as expected:

-r-------- 1 splunk splunk 88 Oct 29  2020 /opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key

Splunk web is using https with my own cert. All other certs are default/unchanged.

inventsekar · ‎12-27-2021

as u r running this thru ansible, may i know if u have a simple test/dev system, if yes, pls try replicating this task in the test/dev system.

i have worked on wiredtiger migration and it worked like a charm.

the kvstore process didnt start can be troublesome issue at times.

rtadams89 · ‎12-27-2021

I'm running the Splunk Docker image (https://registry.hub.docker.com/r/splunk/splunk/) latest tag.

I just deployed a new container, with all the default settings (other than setting a Splunk password as required) and attempted the wireTiger migration. This works exactly as described so there has to be something unique to my production Docker image causing the issue. I have $SPLUNK_HOME/etc and $SPLUNK_HOME/var mapped to persistent volumes, but the production Docker image itself has been reset to defaults. So either something is broken with the mongodb files in $SPLUNK_HOME/etc or I have a configuration file that Splunk is unhappy with.

inventsekar · ‎12-27-2021

the splunk docker at github page link is this one:

https://github.com/splunk/docker-splunk

i would suggest you to raise as an issue at the github issues tracker, and there the developers could see your issue and reply accordingly to you.

though there are lot of docker splunk admins here, the original dev guys may not be active here in Splunk Community. hope you got my point, thanks.

rtadams89 · ‎12-27-2021

I don't think it is an issue with the Docker image, as my test shows the migration to wiredTiger works fine with the default image. It isn't until I mount my persistent $SPLUNK_HOME/etc and $SPLUNK_HOME/var directors that the migration will fail. It has to be something with either my mongodb files or my config files.

Error migrating to wiredTiger

Linux

Splunk Admins and App Developers | Earn a $35 gift card!

Enterprise Security Content Update (ESCU) | New Releases

Monitoring MariaDB and MySQL