Solved: Why does my indexer keep on crashing- IndexerTPool...

sylim_splunk · ‎09-08-2022

Indexer rebooted no-gracefully. After reboot Splunk starts generating crash files shortly after restart. Spent the last two days running fsck repair on all buckets. doesn't seem to have helped.

No relevant errors in the splunkd.log.

Crash log files:

crash-2022-09-07-14:45:07.log
crash-2022-09-07-14:45:15.log
crash-2022-09-07-14:45:24.log
crash-2022-09-07-14:45:32.log
crash-2022-09-07-14:45:40.log

Crash log: every crash log has the same patterns as below by changing Crashing threads, such as IndexerTPoolWorker-2, IndexerTPoolWorker-4, IndexerTPoolWorker-7 and the like..

[build 87344edfcdb4] 2022-09-07 13:31:01
Received fatal signal 6 (Aborted) on PID 193171.
Cause:
Signal sent by PID 193171 running under UID 53292.
Crashing thread: IndexerTPoolWorker-2

Backtrace (PIC build):
[0x00007EFDA0A27387] gsignal + 55 (libc.so.6 + 0x36387)
[0x00007EFDA0A28A78] abort + 328 (libc.so.6 + 0x37A78)
[0x00007EFDA0A201A6] ? (libc.so.6 + 0x2F1A6)
[0x00007EFDA0A20252] ? (libc.so.6 + 0x2F252)
[0x000056097778BA2C] ReadableJournalSliceDirectory::findEventTimeRange(int*, int*, bool)

...

Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

sylim_splunk · ‎09-08-2022

Here's the steps to find the culprits.

There are 17 crash logs found in the diag file. The crashing point from crash log files are all the same as below,
---- excerpts ---
Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

------------------
Which suggests that there must be corrupt bucket(s) or truncated but the crash log files don't have details of which bucket(s) being corrupt at all.

To find the relevant logs in splunkd.log by matching the time, 14:45:24 and thread number such as, IndexerTPoolWorker-4 from the crash log file , I found the logs as below;

- In crash-2022-09-07-14:45:24.log: Crashing thread: IndexerTPoolWorker-4

Check the splunkd.log to find the logs created at 14:45:24 by IndexerTPoolWorker-4 ;

splunkd.log: 09-07-2022 14:45:24.025 -0700 INFO HotBucketRoller [322700 IndexerTPoolWorker-4] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- In crash-2022-09-07-14:45:32.log: Crashing thread: IndexerTPoolWorker-7

Check splunkd.log to find the logs created at 14:45:32 by IndexerTPoolWorker-7 ;

splunkd.log: 09-07-2022 14:45:32.238 -0700 INFO HotBucketRoller [322983 IndexerTPoolWorker-7] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- crash-2022-09-07-14:45:40.log: Crashing thread: IndexerTPoolWorker-6

splunkd.log: 09-07-2022 14:45:40.566 -0700 INFO HotBucketRoller [323220 IndexerTPoolWorker-6] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

The crash log time and the thread number, IndexerTPoolWorker-# matches and the log messages pointing to the same bucket, '/storage/tier1/20000_idx2/db/hot_v1_432'.

As it's suspected to have gone beyond the fsck repair we moved it out of the $SPLUNK_DB and then the indexer came up and running as the other peers.

This covers the specific situation of a corrupt bucket causing the indexer to crash and how to find the corrupt buckets.

View solution in original post

sylim_splunk · ‎09-08-2022