Knowledge Management

In upgrading Splunk Enterprise from 7.1.3 to 7.2.0, why is the Mongo Migration failing?

Path Finder

Hello,

I am giving the Splunk Enterprise 7.1.3 to 7.2.0 upgrade a try in my test environment, and I am currently stuck on the Search Cluster upgrade. I first attempted to do one node at a time, which failed, and then took the entire search cluster offline to do the upgrade. Now, I cannot get the Splunk service to start back up and am getting the following error message.

The Search Deployer had a similar issue with the upgrade, but it was resolved with a simple reboot of the instance. I tired the same thing on the search node, along with killing one leftover mongodb process, neither helped.

I also attempted to run the 'splunk migrate migrate-kvstore' command based on other Splunk Answers posts, which also failed with the same reason.

It seems that the Splunk default certificates are being used. If certificate validation is turned on using the default certificates (not-recommended), this may result in loss of communication in mixed-version Splunk environments after upgrade. 

"/opt/splunk/etc/auth/ca.pem": already a renewed Splunk certificate: skipping renewal
"/opt/splunk/etc/auth/cacert.pem": already a renewed Splunk certificate: skipping renewal
Clustering migration already complete, no further changes required.

Generating checksums for datamodel and report acceleration bucket summaries for all indexes.
If you have defined many indexes and summaries, summary checksum generation may take a long time.
Processed 2 out of 22 configured indexes.
Processed 4 out of 22 configured indexes.
Processed 6 out of 22 configured indexes.
Processed 8 out of 22 configured indexes.
Processed 10 out of 22 configured indexes.
Processed 12 out of 22 configured indexes.
Processed 14 out of 22 configured indexes.
Processed 16 out of 22 configured indexes.
Processed 18 out of 22 configured indexes.
Processed 20 out of 22 configured indexes.
Processed 22 out of 22 configured indexes.
Finished generating checksums for datamodel and report acceleration bucket summaries for all indexes.
ERROR: Failed to migrate mongo feature compatibility version:
ERROR while running migrate-kvstore migration.

I looked in the splunkd.log and mongo.log files, but there are no new events that have been created since I shutdown the service prior to starting the 'rpm' upgrade. They both end with the related shutdown event as shown below.

[root@ip-10-2-31-134 ~]# tail -n 10 /opt/splunk/var/log/splunk/splunkd.log
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_Queue"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_CallbackRunner"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_HttpClient"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_DmcProxyHttpClient"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_Duo2FAHttpClient"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_ApplicationLicenseChecker"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_S3ConnectionPoolManager"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - shutting down level "ShutdownLevel_TelemetryMetricBuffer"
10-11-2018 18:48:34.583 +0000 INFO  ShutdownHandler - Shutdown complete in 36.05 seconds
10-11-2018 18:48:35.581 +0000 INFO  loader - All pipelines finished.

[root@ip-10-2-31-134 ~]# tail -n 10 /opt/splunk/var/log/splunk/mongod.log
 2018-10-11T18:48:02.886Z I JOURNAL  [signalProcessingThread] old journal file /opt/splunk/var/lib/splunk/kvstore/mongo/journal/j._0 will be reused as /opt/splunk/var/lib/splunk/kvstore/mongo/journal/prealloc.0
 2018-10-11T18:48:02.887Z I JOURNAL  [signalProcessingThread] Terminating durability thread ...
 2018-10-11T18:48:02.986Z I JOURNAL  [journal writer] Journal writer thread stopped
 2018-10-11T18:48:02.986Z I JOURNAL  [durability] Durability thread stopped
 2018-10-11T18:48:02.986Z I STORAGE  [signalProcessingThread] shutdown: closing all files...
 2018-10-11T18:48:02.986Z I STORAGE  [signalProcessingThread] closeAllFiles() finished
 2018-10-11T18:48:02.986Z I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
 2018-10-11T18:48:02.986Z I CONTROL  [signalProcessingThread] now exiting
 2018-10-11T18:48:02.986Z I CONTROL  [signalProcessingThread] shutting down with code:0
 2018-10-11T18:48:02.986Z I CONTROL  [initandlisten] shutting down with code:0

Thanks,
Erik

1 Solution

Path Finder

Success...for my issues!!!

I believe I have solved all of the problems I was seeing including the upgrade/migration failures, the kvstore not starting, and the SSL error message…with a removal of the double quotes around the “tls1.2” value for the sslVersions setting in a custom apps we deploy to our instances. I am still working through the upgrades of the other instances in the environment to make sure.

[root@ip-10-2-29-7 tmp]# /opt/splunk/bin/splunk cmd btool server list --debug | grep tls
/opt/splunk/etc/system/default/server.conf sslVersions = tls1.2
/opt/splunk/etc/system/default/server.conf sslVersions = tls1.2
/opt/splunk/etc/apps/aws-poc-test-us-east-1-infrastructure-outputs/local/server.conf sslVersions = tls1.2
/opt/splunk/etc/system/default/server.conf sslVersionsForClient = tls1.2

Let me walk through the steps I did to finally get the Cluster Master successfully upgraded to v7.2.0, which are the same steps I am going to work through on the other Splunk Enterprise instances within our SplunkPOC environment.

  1. Commented out the highlighted line above
  2. Ran the upgrade/migration and started Splunk, no issue within the migration steps
  3. Confirmed v7.2.0 was running, confirmed the Mongod service was running
  4. Confirmed “./splunk show kvstore-status” was set to “ready”
  5. Uncommented the highlighted line above and restarted Splunk
  6. Confirmed everything started up
  7. Found that “./splunk show kvstore-status” was stuck at “starting”
  8. Removed the double quotes and restarted Splunk
  9. Confirmed everything started up
  10. Found that “./splunk show kvstore-status” was set to “ready”

Below are the second and third outputs from the “./splunk show kvstore-status” command as outlined above, the first one scrolled off the terminal screen before I could grab it.

[root@ip-10-2-29-7 tmp]# /opt/splunk/bin/splunk show kvstore-status
Your session is invalid. Please login.
Splunk username: admin
Password:
This member:
backupRestoreStatus : Ready
disabled : 0
guid : 386AC707-E7CA-4827-9E6A-2116283D9727
port : 8191
standalone : 1
status : starting

[root@ip-10-2-29-7 tmp]# /opt/splunk/bin/splunk show kvstore-status
Your session is invalid. Please login.
Splunk username: admin
Password:
This member:
backupRestoreStatus : Ready
date : Fri Nov 2 19:42:38 2018
dateSec : 1541187758.074
disabled : 0
guid : 386AC707-E7CA-4827-9E6A-2116283D9727
oplogEndTimestamp : Fri Nov 2 19:42:37 2018
oplogEndTimestampSec : 1541187757
oplogStartTimestamp : Wed Aug 29 22:16:47 2018
oplogStartTimestampSec : 1535581007
port : 8191
replicaSet : 386AC707-E7CA-4827-9E6A-2116283D9727
replicationStatus : KV store captain
standalone : 1
status : ready

KV store members:
127.0.0.1:8191
configVersion : 1
electionDate : Fri Nov 2 19:42:26 2018
electionDateSec : 1541187746
hostAndPort : 127.0.0.1:8191
optimeDate : Fri Nov 2 19:42:37 2018
optimeDateSec : 1541187757
replicationStatus : KV store captain
uptime : 13

I am working through upgrading the rest of our SplunkPOC environment, upgrading the Search Cluster first, then the Indexing Cluster, followed by the other parts, using the process outlined above. I am going to start with just removing the double quotes around the “tls1.2” value for the sslVersions setting and seeing if the upgrade/migration completes without any issues. If the upgrade/migration still fails, I will complete entire process as outlined above.

View solution in original post

Explorer

I couldn't figure out what the problem was with my local config and none of the other suggestions worked, but I managed to get the migration to complete by removing my local configuration completely, e.g.

cd /opt/splunk/etc/system/local
mkdir foo
mv * foo
/opt/splunk/bin/splunk start
# complete upgrade successfully
/opt/splunk/bin/splunk start
mv -f foo/* .
rmdir foo
/opt/splunk/bin/splunk start
0 Karma

Explorer

We had a similar issue that we were able to fix, but nothing anyone posted here was able to help.

Our server was RHEL 7, Splunk Enterprise 6.6.3, FIPS mode enabled, and KVstore disabled. It wouldn't start post-upgrade and threw the same errors.

The fix for us was to enabled the KVstore, use the deprecated SSL attributes referenced here, along with the deprecated attribute "caCertPath". It would not work if we used the modern SSL attributes introduced in 6.5. The service started fine after that. We then disabled the KVstore, removed those four deprecated attributes, and were able to successfully restart the service.

Before doing all of that, we also tried upgrading from 6.6.3 to 7.1.4, and had no issues. This issue only happened when upgrading to either 7.2.0 or 7.2.1. Both versions threw the same error, even though the KVstore was configured to be disabled. It seems 7.2.0-1 do not respect the "disabled=true" attribute when upgrading, but only when upgrading.

EDIT: --Additional Solution Found--
I wasn't happy with that being the solution and did some testing. I discovered that 7.2.1 does check for the modern SSL attributes, but it seems you can use relative paths instead of the actual value, excluding "sslPassword". It's never worked with actual values for "sslRootCAPath" and "serverCert", but the relative paths worked fine. So those three attributes paired with temporarily enabling the kvstore, also did the trick.

0 Karma

Path Finder

Success...for my issues!!!

I believe I have solved all of the problems I was seeing including the upgrade/migration failures, the kvstore not starting, and the SSL error message…with a removal of the double quotes around the “tls1.2” value for the sslVersions setting in a custom apps we deploy to our instances. I am still working through the upgrades of the other instances in the environment to make sure.

[root@ip-10-2-29-7 tmp]# /opt/splunk/bin/splunk cmd btool server list --debug | grep tls
/opt/splunk/etc/system/default/server.conf sslVersions = tls1.2
/opt/splunk/etc/system/default/server.conf sslVersions = tls1.2
/opt/splunk/etc/apps/aws-poc-test-us-east-1-infrastructure-outputs/local/server.conf sslVersions = tls1.2
/opt/splunk/etc/system/default/server.conf sslVersionsForClient = tls1.2

Let me walk through the steps I did to finally get the Cluster Master successfully upgraded to v7.2.0, which are the same steps I am going to work through on the other Splunk Enterprise instances within our SplunkPOC environment.

  1. Commented out the highlighted line above
  2. Ran the upgrade/migration and started Splunk, no issue within the migration steps
  3. Confirmed v7.2.0 was running, confirmed the Mongod service was running
  4. Confirmed “./splunk show kvstore-status” was set to “ready”
  5. Uncommented the highlighted line above and restarted Splunk
  6. Confirmed everything started up
  7. Found that “./splunk show kvstore-status” was stuck at “starting”
  8. Removed the double quotes and restarted Splunk
  9. Confirmed everything started up
  10. Found that “./splunk show kvstore-status” was set to “ready”

Below are the second and third outputs from the “./splunk show kvstore-status” command as outlined above, the first one scrolled off the terminal screen before I could grab it.

[root@ip-10-2-29-7 tmp]# /opt/splunk/bin/splunk show kvstore-status
Your session is invalid. Please login.
Splunk username: admin
Password:
This member:
backupRestoreStatus : Ready
disabled : 0
guid : 386AC707-E7CA-4827-9E6A-2116283D9727
port : 8191
standalone : 1
status : starting

[root@ip-10-2-29-7 tmp]# /opt/splunk/bin/splunk show kvstore-status
Your session is invalid. Please login.
Splunk username: admin
Password:
This member:
backupRestoreStatus : Ready
date : Fri Nov 2 19:42:38 2018
dateSec : 1541187758.074
disabled : 0
guid : 386AC707-E7CA-4827-9E6A-2116283D9727
oplogEndTimestamp : Fri Nov 2 19:42:37 2018
oplogEndTimestampSec : 1541187757
oplogStartTimestamp : Wed Aug 29 22:16:47 2018
oplogStartTimestampSec : 1535581007
port : 8191
replicaSet : 386AC707-E7CA-4827-9E6A-2116283D9727
replicationStatus : KV store captain
standalone : 1
status : ready

KV store members:
127.0.0.1:8191
configVersion : 1
electionDate : Fri Nov 2 19:42:26 2018
electionDateSec : 1541187746
hostAndPort : 127.0.0.1:8191
optimeDate : Fri Nov 2 19:42:37 2018
optimeDateSec : 1541187757
replicationStatus : KV store captain
uptime : 13

I am working through upgrading the rest of our SplunkPOC environment, upgrading the Search Cluster first, then the Indexing Cluster, followed by the other parts, using the process outlined above. I am going to start with just removing the double quotes around the “tls1.2” value for the sslVersions setting and seeing if the upgrade/migration completes without any issues. If the upgrade/migration still fails, I will complete entire process as outlined above.

View solution in original post

Path Finder

FYI, I was able to complete the entire environment upgrade (Search Cluster and Deployer, Indexing Cluster and Master, Event Collectors, Heavy Forwarders, and Deployment server) to Splunk v7.2.1 by simply removing the double quotes around the “tls1.2” value for the sslVersions setting in a custom apps we deploy to our instances. I did not end up needing to comment it out during the upgrade process as originally outline in the steps provide above.

Engager

I was able to get splunk to start by removing my site-specific TLS settings. After starting splunk the first time, I restored my site-specific settings back to how they were before the upgrade, and restarted splunk. It seems to be fine now.

These are the config changes I made:

In /opt/splunk/etc/system/local/server.conf:

Changed these lines:

[sslConfig]
cipherSuite = EECDH+AESGCM
ecdhCurves = secp256r1, secp384r1, secp521r1 

to

[sslConfig]
#cipherSuite = EECDH+AESGCM
#ecdhCurves = secp256r1, secp384r1, secp521r1 

Also, in /opt/splunk/etc/system/local/web.conf:

[settings]
cipherSuite = EECDH+AESGCM
ecdhCurves = secp256r1, secp384r1, secp521r1

changed to:

[settings]
#cipherSuite = EECDH+AESGCM
#ecdhCurves = secp256r1, secp384r1, secp521r1

Again, after starting splunk and completing the config migration, I restored the files and restarted splunk.

Path Finder

Hi Folks,

i've got another question to this. We have got this issue only on our indexers. Searchheads, Heavyforworder no problem…

I hashed the tls settings out on one of our 3 indexers but im still getting the migration error.
sh: line 1: 27177 Segmentation fault (core dumped) splunk migrate migrate-kvstore 2>&1
ERROR while running migrate-kvstore migration.

Is it enough to hash the tls settings out, on the node i want upgrade?

Thanks
Alex

0 Karma

Engager

We had the same problem. The issue turned out to be that we had had to strengthen the TLS cipherSuite on our Splunk boxes to meet PCI requirements. We had to weaken it slightly on the search heads, because mongod unfortunately uses the same cipherSuite that splunkd uses from server.conf, and it wouldn't start w/o an additional cipher. We didn't worry about it on the indexers, etc. that didn't need to run mongod anyway. Unforunately, it needs to be able to start during the 7.2 (maybe just 7.x?) upgrade process to do a kvstore migration, and because it can't, we get the error you got above.

Once we weakened the cipherSuite across the board, the upgrade migration was able to start mongod and proceed.

The confusing thing about it, and what we submitted as a feature request as part of our support case, was that mongod.log has nothing when it fails to start for this reason. This made troubleshooting somewhat difficult.

Engager

Can you indicate what specific changes you made to which specific files to resolve the problem?

Engager

In server.conf, we had limited the cipher suite thusly:

[sslConfig]
cipherSuite = ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDH-ECDSA-AES256-GCM-SHA384:ECDH-ECDSA-AES128-GCM-SHA256:ECDH-ECDSA-AES128-SHA256

This list was fine for splunkd, but mongod doesn't like it. We had to temporarily add an additional cipher to the list ("AES256-GCM-SHA384") so that mongod, and thus the 7.2 upgrade migration commands would work. After the upgrade we removed that cipher from all the servers that don't need to run kvstore/mongod.

This is all to meet an TLS hardening requirement to pass a PCI vulnerability scan.

Path Finder

This is actually mentioned in the (Default) server.conf since around ver. 6.6.x, in that Mongo DB aka. kvstore (not Splunk) does not support Forward-Secrecy ciphers i.e. Mongo needs to have RSA ones to work.

 # The following non-forward-secrecy ciphers were added to support the kv store:
#     AES256-GCM-SHA384:AES128-GCM-SHA256:AES128-SHA256.
0 Karma

Path Finder

Hi,
we are having the same issues - a support case is opened, but now result until now.

Alex

0 Karma

Path Finder

We already updated your Linux Kernel to Version
Linux 4.4.156-94.57-default #1 SMP Tue Oct 2 06:43:37 UTC 2018 (82521a6) x8664 x8664 x86_64 GNU/Linux
but the segfault is still going on with that version.

0 Karma

Motivator

Still nothing? Just ran into this on a 7.1.0 to 7.2.0 upgrade

0 Karma

Path Finder

Nope, Splunk Support is on Deep Dive....

0 Karma

Path Finder

Not an answer but I have the same issue upgrading from 7.1.1 to 7.2

Finished generating checksums for datamodel and report acceleration bucket summaries for all indexes.
ERROR: Failed to migrate mongo feature compatibility version:
ERROR while running migrate-kvstore migration.

and again nothing in the logs. Reboot did not fix it.

0 Karma

Path Finder

I opened up a Enterprise Support ticket for this issue and will update this post when we have figured out the issues with the fix.

0 Karma

Path Finder

@capilarity, would you mind posting what Splunk Base apps you have installed on your Splunk instance that also failed the upgrade from 7.1.1 to 7.2.0, along with your setup (Search Cluster, Indexer Cluster, etc.)?

Splunk Support and I are trying to track down the issues and I recently found that our base installs, nothing special setup, upgrades from v6.6.5 to v7.1.3 to v7.2.0 and from v6.6.5 straight to v7.2.0 without issues. I am thinking there is an issue with the specific settings on each instance to make them a Search Cluster, Indexer Cluster, Deployer, Master Node, etc. and/or there is an issue with one of the Splunk Base apps that is installed in my Splunk environment that is having issues with the upgrade.

I would like to provide them a comparison of two setups having the same issues. Thanks!

0 Karma

Path Finder

We have no splunk base apps installled on this instance, and only one home grown app that monitors changes to the config.
This is our master node, in addtion we have two indexers in site 1 site 2 configuration, two non clustered search head and deployment server all seperate.

0 Karma

Path Finder

Thanks, I will relay the information to Splunk Support.

0 Karma

Path Finder

Thanks all, this has now been resolved with the help of support. We had defined tls 1.2 for our splunk to splunk comms and this was forcing mongodb to use the same.
hash out the config, upgrade completed fine.
Once upgrade complete, revert to original tls configuration and restart.

0 Karma