Splunk Enterprise

Why does my indexer keep on crashing- IndexerTPoolWorker by abort

sylim_splunk
Splunk Employee
Splunk Employee

Indexer rebooted no-gracefully. After reboot Splunk starts generating crash files shortly after restart. Spent the last two days running fsck repair on all buckets. doesn't seem to have helped.

No relevant errors in the splunkd.log.

Crash log files:

crash-2022-09-07-14:45:07.log
crash-2022-09-07-14:45:15.log
crash-2022-09-07-14:45:24.log
crash-2022-09-07-14:45:32.log
crash-2022-09-07-14:45:40.log

Crash log: every crash log has the same patterns as below by changing Crashing threads, such as IndexerTPoolWorker-2, IndexerTPoolWorker-4, IndexerTPoolWorker-7 and the like..

[build 87344edfcdb4] 2022-09-07 13:31:01
Received fatal signal 6 (Aborted) on PID 193171.
Cause:
Signal sent by PID 193171 running under UID 53292.
Crashing thread: IndexerTPoolWorker-2

Backtrace (PIC build):
[0x00007EFDA0A27387] gsignal + 55 (libc.so.6 + 0x36387)
[0x00007EFDA0A28A78] abort + 328 (libc.so.6 + 0x37A78)
[0x00007EFDA0A201A6] ? (libc.so.6 + 0x2F1A6)
[0x00007EFDA0A20252] ? (libc.so.6 + 0x2F252)
[0x000056097778BA2C] ReadableJournalSliceDirectory::findEventTimeRange(int*, int*, bool)

...

Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

Labels (1)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

Here's the steps to find the culprits.

There are 17 crash logs found in the diag file. The crashing point from crash log files are all the same as below,
---- excerpts ---
Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

------------------
Which suggests that there must be corrupt bucket(s) or truncated but the crash log files don't have details of which bucket(s) being corrupt at all.

To find the relevant logs in splunkd.log by matching the time, 14:45:24 and thread number such as, IndexerTPoolWorker-4 from the crash log file , I found the logs as below;


- For crash-2022-09-07-14:45:24.log:   
Crashing thread: IndexerTPoolWorker-4

Check the splunkd.log to find the logs created at 14:45:24 by IndexerTPoolWorker-4 ;

splunkd.log: 09-07-2022 14:45:24.025 -0700 INFO HotBucketRoller [322700 IndexerTPoolWorker-4] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- For crash-2022-09-07-14:45:32.log: Crashing thread: IndexerTPoolWorker-7

Check splunkd.log to find the logs created at 14:45:32 by IndexerTPoolWorker-7 ;


splunkd.log: 09-07-2022 14:45:32.238 -0700 INFO HotBucketRoller [322983 IndexerTPoolWorker-7] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- crash-2022-09-07-14:45:40.log: Crashing thread: IndexerTPoolWorker-6


splunkd.log: 09-07-2022 14:45:40.566 -0700 INFO HotBucketRoller [323220 IndexerTPoolWorker-6] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'


The crash log time and the thread number, IndexerTPoolWorker-# matches and the log messages pointing to the same bucket, '/storage/tier1/20000_idx2/db/hot_v1_432'.

As it's suspected to have gone beyond the fsck repair we moved it out of the $SPLUNK_DB and then the indexer came up and running as the other peers.

 

This covers the specific situation of a corrupt bucket causing the indexer to crash and how to find the corrupt buckets.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

Here's the steps to find the culprits.

There are 17 crash logs found in the diag file. The crashing point from crash log files are all the same as below,
---- excerpts ---
Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

------------------
Which suggests that there must be corrupt bucket(s) or truncated but the crash log files don't have details of which bucket(s) being corrupt at all.

To find the relevant logs in splunkd.log by matching the time, 14:45:24 and thread number such as, IndexerTPoolWorker-4 from the crash log file , I found the logs as below;


- For crash-2022-09-07-14:45:24.log:   
Crashing thread: IndexerTPoolWorker-4

Check the splunkd.log to find the logs created at 14:45:24 by IndexerTPoolWorker-4 ;

splunkd.log: 09-07-2022 14:45:24.025 -0700 INFO HotBucketRoller [322700 IndexerTPoolWorker-4] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- For crash-2022-09-07-14:45:32.log: Crashing thread: IndexerTPoolWorker-7

Check splunkd.log to find the logs created at 14:45:32 by IndexerTPoolWorker-7 ;


splunkd.log: 09-07-2022 14:45:32.238 -0700 INFO HotBucketRoller [322983 IndexerTPoolWorker-7] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- crash-2022-09-07-14:45:40.log: Crashing thread: IndexerTPoolWorker-6


splunkd.log: 09-07-2022 14:45:40.566 -0700 INFO HotBucketRoller [323220 IndexerTPoolWorker-6] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'


The crash log time and the thread number, IndexerTPoolWorker-# matches and the log messages pointing to the same bucket, '/storage/tier1/20000_idx2/db/hot_v1_432'.

As it's suspected to have gone beyond the fsck repair we moved it out of the $SPLUNK_DB and then the indexer came up and running as the other peers.

 

This covers the specific situation of a corrupt bucket causing the indexer to crash and how to find the corrupt buckets.

Get Updates on the Splunk Community!

Security Highlights | November 2022 Newsletter

 November 2022 2022 Gartner Magic Quadrant for SIEM: Splunk Named a Leader for the 9th Year in a RowSplunk is ...

Platform Highlights | November 2022 Newsletter

 November 2022 Skill Up on Splunk with our New Builder Tech Talk SeriesCan you build it? Yes you can! *play ...

Splunk Education - Fast Start Program!

Welcome to Splunk Education! Splunk training programs are designed to enable you to get started quickly and ...