Splunk Enterprise

What is the problem with splunkd recover-metada --handle-roll processes - until it crashes splunk?

ChrisFontana
Loves-to-Learn Lots

Hello everyone,

I recently migrated from and old running hardware to a newer hardware and started to index the same that that I was indexing before (njmon - json version of nmon).

In the old infra, same Splunk version running on both environments, it has no big issues and it has lower capacity than in the new one. Also, in the newer one, we're using faster disks (nmvme).

After 1-3 days injecting njmon data, the indexer crashes and during this time, I can see a lot of splunkd recover-metada processes and also splunkd fsck --log-to--splunkd-log repair processes:

[root@splunk]# ps -ef | grep splunkd
splunk 21828 16396 99 12:00 ? 02:59:44 splunkd fsck --log-to--splunkd-log repair --try-warm-then-cold --one-bucket --index-name=njmon --bucket-name=db_1683179867_1683170671_48 --bloomfilter-only
splunk 21829 21828 0 12:00 ? 00:00:00 splunkd fsck --log-to--splunkd-log repair --try-warm-then-cold --one-bucket --index-name=njmon --bucket-name=db_1683179867_1683170671_48 --bloomfilter-only
splunk 41284 16396 99 12:20 ? 02:40:30 splunkd recover-metadata /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683195586_1683179630_51 --handle-roll njmon /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683195586_1683179630_51 --write-level 4 --tsidx-target-size 1572864000 --msidx-comp-block-size 1024
splunk 80067 16396 99 12:59 ? 02:00:53 splunkd recover-metadata /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683197989_1683180366_54 --handle-roll njmon /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683197989_1683180366_54 --write-level 4 --tsidx-target-size 1572864000 --msidx-comp-block-size 1024
splunk 136806 16396 99 13:44 ? 01:16:45 splunkd recover-metadata /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683200654_1683180434_53 --handle-roll njmon /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683200654_1683180434_53 --write-level 4 --tsidx-target-size 1572864000 --msidx-comp-block-size 1024

 

The server is running RHEL 8, 128 RAM, 48 Physical Procs, 96 Logical.

Splunk Version: Splunk 8.2.10 (build 417e74d5c950)

 

The difference between old infra and new one is the tsidxlevel. In the old infra, we're using 2, in the newer one, we're using 4, but all the environments using the data are using the version greater or equal to 8.2.

 

Any hints from the community?

Labels (1)
Tags (2)
0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

(view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...