What is the problem with splunkd recover-metada --...

ChrisFontana · ‎05-04-2023

Hello everyone,

I recently migrated from and old running hardware to a newer hardware and started to index the same that that I was indexing before (njmon - json version of nmon).

In the old infra, same Splunk version running on both environments, it has no big issues and it has lower capacity than in the new one. Also, in the newer one, we're using faster disks (nmvme).

After 1-3 days injecting njmon data, the indexer crashes and during this time, I can see a lot of splunkd recover-metada processes and also splunkd fsck --log-to--splunkd-log repair processes:

[root@splunk]# ps -ef | grep splunkd
splunk 21828 16396 99 12:00 ? 02:59:44 splunkd fsck --log-to--splunkd-log repair --try-warm-then-cold --one-bucket --index-name=njmon --bucket-name=db_1683179867_1683170671_48 --bloomfilter-only
splunk 21829 21828 0 12:00 ? 00:00:00 splunkd fsck --log-to--splunkd-log repair --try-warm-then-cold --one-bucket --index-name=njmon --bucket-name=db_1683179867_1683170671_48 --bloomfilter-only
splunk 41284 16396 99 12:20 ? 02:40:30 splunkd recover-metadata /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683195586_1683179630_51 --handle-roll njmon /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683195586_1683179630_51 --write-level 4 --tsidx-target-size 1572864000 --msidx-comp-block-size 1024
splunk 80067 16396 99 12:59 ? 02:00:53 splunkd recover-metadata /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683197989_1683180366_54 --handle-roll njmon /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683197989_1683180366_54 --write-level 4 --tsidx-target-size 1572864000 --msidx-comp-block-size 1024
splunk 136806 16396 99 13:44 ? 01:16:45 splunkd recover-metadata /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683200654_1683180434_53 --handle-roll njmon /net/splunk/fs0/splunk-hotwarm/njmon/db/db_1683200654_1683180434_53 --write-level 4 --tsidx-target-size 1572864000 --msidx-comp-block-size 1024

The server is running RHEL 8, 128 RAM, 48 Physical Procs, 96 Logical.

Splunk Version: Splunk 8.2.10 (build 417e74d5c950)

The difference between old infra and new one is the tsidxlevel. In the old infra, we're using 2, in the newer one, we're using 4, but all the environments using the data are using the version greater or equal to 8.2.

Any hints from the community?

What is the problem with splunkd recover-metada --handle-roll processes - until it crashes splunk?

troubleshooting

Developer Spotlight with Paul Stout

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

Data-Driven Success: Splunk & Financial Services