Knowledge Management
Highlighted

kv store failing again and again After Splunk Upgrade, out of memory errors (6.6.3 --> 7.1.6)

Here is the mongod.log

2019-02-15T08:12:44.837Z W CONTROL  [main] net.ssl.sslCipherConfig is deprecated. It will be removed in a future release.
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] MongoDB starting : pid=8956 port=8191 dbpath=D:\Splunk\var\lib\splunk\kvstore\mongo 64-bit host=XXXXXX
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] db version v3.4.10-splunk
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] git version: 078f28920cb24de0dd479b5ea6c66c644f6326e9
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.2p-fips  14 Aug 2018
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] allocator: tcmalloc
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] modules: none
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] build environment:
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten]     distmod: 2008plus-ssl
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten]     distarch: x86_64
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten]     target_arch: x86_64
 2019-02-15T08:12:45.266Z I CONTROL  [initandlisten] options: { net: { port: 8191, ssl: { PEMKeyFile: "D:\Splunk\etc\auth\server.pem", PEMKeyPassword: "<password>", allowInvalidHostnames: true, mode: "requireSSL", sslCipherConfig: "TLSv1+HIGH:TLSv1.2+HIGH:@STRENGTH" } }, replication: { oplogSizeMB: 4000, replSet: "27E07F05-DA98-4185-B26F-95378779627A" }, security: { javascriptEnabled: false, keyFile: "D:\Splunk\var\lib\splunk\kvstore\mongo\splunk.key" }, setParameter: { enableLocalhostAuthBypass: "0", oplogFetcherMaxFetcherRestarts: "0" }, storage: { dbPath: "D:\Splunk\var\lib\splunk\kvstore\mongo", engine: "mmapv1", mmapv1: { smallFiles: true } }, systemLog: { timeStampFormat: "iso8601-utc" } }
 2019-02-15T08:12:45.267Z W -        [initandlisten] Detected unclean shutdown - D:\Splunk\var\lib\splunk\kvstore\mongo\mongod.lock is not empty.
 2019-02-15T08:12:45.288Z I JOURNAL  [initandlisten] journal dir=D:\Splunk\var\lib\splunk\kvstore\mongo\journal
 2019-02-15T08:12:45.288Z I JOURNAL  [initandlisten] recover begin
 2019-02-15T08:12:45.288Z I JOURNAL  [initandlisten] info no lsn file in journal/ directory
 2019-02-15T08:12:45.288Z I JOURNAL  [initandlisten] recover lsn: 0
 2019-02-15T08:12:45.288Z I JOURNAL  [initandlisten] recover D:\Splunk\var\lib\splunk\kvstore\mongo\journal\j._0
 2019-02-15T08:12:45.289Z I JOURNAL  [initandlisten] recover applying initial journal section with sequence number 683
 2019-02-15T08:12:45.339Z I JOURNAL  [initandlisten] recover cleaning up
 2019-02-15T08:12:45.339Z I JOURNAL  [initandlisten] removeJournalFiles
 2019-02-15T08:12:45.339Z I JOURNAL  [initandlisten] old journal file will be removed: D:\Splunk\var\lib\splunk\kvstore\mongo\journal\j._0
 2019-02-15T08:12:45.340Z I JOURNAL  [initandlisten] recover done
 2019-02-15T08:12:45.446Z I JOURNAL  [durability] Durability thread started
 2019-02-15T08:12:45.448Z I JOURNAL  [journal writer] Journal writer thread started
 2019-02-15T08:12:45.450Z I CONTROL  [initandlisten] 
 2019-02-15T08:12:45.450Z I CONTROL  [initandlisten] ** WARNING: No SSL certificate validation can be performed since no CA file has been provided
 2019-02-15T08:12:45.450Z I CONTROL  [initandlisten] **          Please specify an sslCAFile parameter.
 2019-02-15T08:12:46.379Z I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory 'D:/Splunk/var/lib/splunk/kvstore/mongo/diagnostic.data'
 2019-02-15T08:12:46.396Z I REPL     [initandlisten] Did not find local voted for document at startup.
 2019-02-15T08:12:46.400Z I NETWORK  [thread1] waiting for connections on port 8191 ssl
 2019-02-15T08:12:46.401Z I REPL     [replExecDBWorker-2] New replica set config in use: { _id: "27E07F05-DA98-4185-B26F-95378779627A", version: 1, members: [ { _id: 0, host: "127.0.0.1:8191", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { instance: "27E07F05-DA98-4185-B26F-95378779627A", all: "all" }, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
 2019-02-15T08:12:46.401Z I REPL     [replExecDBWorker-2] This node is 127.0.0.1:8191 in the config
 2019-02-15T08:12:46.402Z I REPL     [replExecDBWorker-2] transition to STARTUP2
 2019-02-15T08:12:46.402Z I REPL     [replExecDBWorker-2] Starting replication storage threads
 2019-02-15T08:12:46.421Z I REPL     [replExecDBWorker-2] Starting replication fetcher thread
 2019-02-15T08:12:46.421Z I REPL     [replExecDBWorker-2] Starting replication applier thread
 2019-02-15T08:12:46.421Z I REPL     [replExecDBWorker-2] Starting replication reporter thread
 2019-02-15T08:12:46.422Z I REPL     [rsSync] transition to RECOVERING
 2019-02-15T08:12:46.425Z I REPL     [rsSync] transition to SECONDARY
 2019-02-15T08:12:46.425Z I REPL     [rsSync] transition to PRIMARY
 2019-02-15T08:12:47.009Z I FTDC     [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
 2019-02-15T08:12:47.433Z I REPL     [rsSync] transition to primary complete; database writes are now permitted
 2019-02-15T08:12:55.777Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49369 #1 (1 connection now open)
 2019-02-15T08:12:55.790Z I NETWORK  [conn1] received client metadata from 127.0.0.1:49369 conn1: { driver: { name: "mongoc", version: "1.6.3" }, os: { type: "Windows", name: "Windows", version: "6.3 (9600)", architecture: "x86_64" }, platform: "cfg=0x200c9 CC=MSVC 1900 CFLAGS="-DPY_BIG_ENDIAN=0  /DWIN32 /D_WINDOWS /W3" LDFLAGS="/machine:x64"" }
 2019-02-15T08:12:55.811Z I ACCESS   [conn1] Successfully authenticated as principal __system on local
 2019-02-15T08:12:55.811Z I -        [conn1] end connection 127.0.0.1:49369 (1 connection now open)
 2019-02-15T08:12:56.813Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49372 #2 (1 connection now open)
 2019-02-15T08:12:56.824Z I NETWORK  [conn2] received client metadata from 127.0.0.1:49372 conn2: { driver: { name: "mongoc", version: "1.6.3" }, os: { type: "Windows", name: "Windows", version: "6.3 (9600)", architecture: "x86_64" }, platform: "cfg=0x200c9 CC=MSVC 1900 CFLAGS="-DPY_BIG_ENDIAN=0  /DWIN32 /D_WINDOWS /W3" LDFLAGS="/machine:x64"" }
 2019-02-15T08:12:56.877Z I ACCESS   [conn2] Successfully authenticated as principal __system on local
 2019-02-15T08:12:56.879Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49374 #3 (2 connections now open)
 2019-02-15T08:12:56.881Z I -        [conn2] end connection 127.0.0.1:49372 (2 connections now open)
 2019-02-15T08:12:56.886Z I NETWORK  [conn3] received client metadata from 127.0.0.1:49374 conn3: { driver: { name: "mongoc", version: "1.6.3" }, os: { type: "Windows", name: "Windows", version: "6.3 (9600)", architecture: "x86_64" }, platform: "cfg=0x200c9 CC=MSVC 1900 CFLAGS="-DPY_BIG_ENDIAN=0  /DWIN32 /D_WINDOWS /W3" LDFLAGS="/machine:x64"" }
 2019-02-15T08:12:56.907Z I ACCESS   [conn3] Successfully authenticated as principal __system on local
 2019-02-15T08:12:56.908Z I -        [conn3] end connection 127.0.0.1:49374 (1 connection now open)
 2019-02-15T08:12:56.916Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49375 #4 (1 connection now open)
 2019-02-15T08:12:56.952Z I NETWORK  [conn4] received client metadata from 127.0.0.1:49375 conn4: { driver: { name: "mongoc", version: "1.6.3" }, os: { type: "Windows", name: "Windows", version: "6.3 (9600)", architecture: "x86_64" }, platform: "cfg=0x200c9 CC=MSVC 1900 CFLAGS="-DPY_BIG_ENDIAN=0  /DWIN32 /D_WINDOWS /W3" LDFLAGS="/machine:x64"" }
 2019-02-15T08:12:56.974Z I ACCESS   [conn4] Successfully authenticated as principal __system on local
 2019-02-15T08:12:56.975Z I -        [conn4] end connection 127.0.0.1:49375 (1 connection now open)
 2019-02-15T08:12:56.978Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49376 #5 (1 connection now open)
 2019-02-15T08:12:56.981Z I NETWORK  [conn5] received client metadata from 127.0.0.1:49376 conn5: { driver: { name: "mongoc", version: "1.6.3" }, os: { type: "Windows", name: "Windows", version: "6.3 (9600)", architecture: "x86_64" }, platform: "cfg=0x200c9 CC=MSVC 1900 CFLAGS="-DPY_BIG_ENDIAN=0  /DWIN32 /D_WINDOWS /W3" LDFLAGS="/machine:x64"" }
 2019-02-15T08:12:56.982Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49377 #6 (2 connections now open)
 2019-02-15T08:12:57.016Z I ACCESS   [conn6] Successfully authenticated as principal __system on local
 2019-02-15T08:12:57.076Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49378 #7 (3 connections now open)
 2019-02-15T08:12:57.086Z I NETWORK  [conn7] received client metadata from 127.0.0.1:49378 conn7: { driver: { name: "mongoc", version: "1.6.3" }, os: { type: "Windows", name: "Windows", version: "6.3 (9600)", architecture: "x86_64" }, platform: "cfg=0x200c9 CC=MSVC 1900 CFLAGS="-DPY_BIG_ENDIAN=0  /DWIN32 /D_WINDOWS /W3" LDFLAGS="/machine:x64"" }
 2019-02-15T08:12:57.088Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49379 #8 (4 connections now open)
 2019-02-15T08:12:57.123Z I ACCESS   [conn8] Successfully authenticated as principal __system on local
 2019-02-15T08:12:58.199Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49383 #9 (5 connections now open)
 2019-02-15T08:12:58.202Z I NETWORK  [conn9] received client metadata from 127.0.0.1:49383 conn9: { driver: { name: "mongoc", version: "1.6.3" }, os: { type: "Windows", name: "Windows", version: "6.3 (9600)", architecture: "x86_64" }, platform: "cfg=0x200c9 CC=MSVC 1900 CFLAGS="-DPY_BIG_ENDIAN=0  /DWIN32 /D_WINDOWS /W3" LDFLAGS="/machine:x64"" }
 2019-02-15T08:12:58.225Z I ACCESS   [conn9] Successfully authenticated as principal __system on local
 2019-02-15T08:12:58.334Z I NETWORK  [thread1] connection accepted from 127.0.0.1:49385 #10 (6 connections now open)
 2019-02-15T08:12:58.359Z I ACCESS   [conn10] Successfully authenticated as principal __system on local
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x1dc493
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x1d926e
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x165c88
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      ???
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x75d91
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      ???
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      ???
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      ???
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      ???
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      ???
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x7c4a82
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x7c4913
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x1619a7
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x1615af
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      ???
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] mongod.exe      index_collator_extension+0x82924d
 2019-02-15T08:13:51.302Z I CONTROL  [conn9] KERNEL32.DLL    BaseThreadInitThunk+0x22
 2019-02-15T08:13:51.302Z F -        [conn9] out of memory.

Increased the oplogsize setting as well but that also not worked out.

0 Karma
Highlighted

Re: kv store failing again and again After Splunk Upgrade, out of memory errors (6.6.3 --> 7.1.6)

Splunk Employee
Splunk Employee

Hi,

something may be eating the memory (could be search activity, you may have been living close to the limit and now hit the wall)

look at https://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/Troubleshootmemoryusage

make sure you also have enough swap so at least it won't crash for temp peak (obviously you don't want to swap all time)

0 Karma