Splunk Enterprise

Why does my indexer keep on crashing- IndexerTPoolWorker by abort

sylim_splunk
Splunk Employee
Splunk Employee

Indexer rebooted no-gracefully. After reboot Splunk starts generating crash files shortly after restart. Spent the last two days running fsck repair on all buckets. doesn't seem to have helped.

No relevant errors in the splunkd.log.

Crash log files:

crash-2022-09-07-14:45:07.log
crash-2022-09-07-14:45:15.log
crash-2022-09-07-14:45:24.log
crash-2022-09-07-14:45:32.log
crash-2022-09-07-14:45:40.log

Crash log: every crash log has the same patterns as below by changing Crashing threads, such as IndexerTPoolWorker-2, IndexerTPoolWorker-4, IndexerTPoolWorker-7 and the like..

[build 87344edfcdb4] 2022-09-07 13:31:01
Received fatal signal 6 (Aborted) on PID 193171.
Cause:
Signal sent by PID 193171 running under UID 53292.
Crashing thread: IndexerTPoolWorker-2

Backtrace (PIC build):
[0x00007EFDA0A27387] gsignal + 55 (libc.so.6 + 0x36387)
[0x00007EFDA0A28A78] abort + 328 (libc.so.6 + 0x37A78)
[0x00007EFDA0A201A6] ? (libc.so.6 + 0x2F1A6)
[0x00007EFDA0A20252] ? (libc.so.6 + 0x2F252)
[0x000056097778BA2C] ReadableJournalSliceDirectory::findEventTimeRange(int*, int*, bool)

...

Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

Labels (1)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

Here's the steps to find the culprits.

There are 17 crash logs found in the diag file. The crashing point from crash log files are all the same as below,
---- excerpts ---
Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

------------------
Which suggests that there must be corrupt bucket(s) or truncated but the crash log files don't have details of which bucket(s) being corrupt at all.

To find the relevant logs in splunkd.log by matching the time, 14:45:24 and thread number such as, IndexerTPoolWorker-4 from the crash log file , I found the logs as below;


- In crash-2022-09-07-14:45:24.log:   
Crashing thread: IndexerTPoolWorker-4

Check the splunkd.log to find the logs created at 14:45:24 by IndexerTPoolWorker-4 ;

splunkd.log: 09-07-2022 14:45:24.025 -0700 INFO HotBucketRoller [322700 IndexerTPoolWorker-4] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- In crash-2022-09-07-14:45:32.log: Crashing thread: IndexerTPoolWorker-7

Check splunkd.log to find the logs created at 14:45:32 by IndexerTPoolWorker-7 ;


splunkd.log: 09-07-2022 14:45:32.238 -0700 INFO HotBucketRoller [322983 IndexerTPoolWorker-7] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- crash-2022-09-07-14:45:40.log: Crashing thread: IndexerTPoolWorker-6


splunkd.log: 09-07-2022 14:45:40.566 -0700 INFO HotBucketRoller [323220 IndexerTPoolWorker-6] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'


The crash log time and the thread number, IndexerTPoolWorker-# matches and the log messages pointing to the same bucket, '/storage/tier1/20000_idx2/db/hot_v1_432'.

As it's suspected to have gone beyond the fsck repair we moved it out of the $SPLUNK_DB and then the indexer came up and running as the other peers.

 

This covers the specific situation of a corrupt bucket causing the indexer to crash and how to find the corrupt buckets.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

Here's the steps to find the culprits.

There are 17 crash logs found in the diag file. The crashing point from crash log files are all the same as below,
---- excerpts ---
Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

------------------
Which suggests that there must be corrupt bucket(s) or truncated but the crash log files don't have details of which bucket(s) being corrupt at all.

To find the relevant logs in splunkd.log by matching the time, 14:45:24 and thread number such as, IndexerTPoolWorker-4 from the crash log file , I found the logs as below;


- In crash-2022-09-07-14:45:24.log:   
Crashing thread: IndexerTPoolWorker-4

Check the splunkd.log to find the logs created at 14:45:24 by IndexerTPoolWorker-4 ;

splunkd.log: 09-07-2022 14:45:24.025 -0700 INFO HotBucketRoller [322700 IndexerTPoolWorker-4] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- In crash-2022-09-07-14:45:32.log: Crashing thread: IndexerTPoolWorker-7

Check splunkd.log to find the logs created at 14:45:32 by IndexerTPoolWorker-7 ;


splunkd.log: 09-07-2022 14:45:32.238 -0700 INFO HotBucketRoller [322983 IndexerTPoolWorker-7] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- crash-2022-09-07-14:45:40.log: Crashing thread: IndexerTPoolWorker-6


splunkd.log: 09-07-2022 14:45:40.566 -0700 INFO HotBucketRoller [323220 IndexerTPoolWorker-6] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'


The crash log time and the thread number, IndexerTPoolWorker-# matches and the log messages pointing to the same bucket, '/storage/tier1/20000_idx2/db/hot_v1_432'.

As it's suspected to have gone beyond the fsck repair we moved it out of the $SPLUNK_DB and then the indexer came up and running as the other peers.

 

This covers the specific situation of a corrupt bucket causing the indexer to crash and how to find the corrupt buckets.

Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...

Can’t Make It to Boston? Stream .conf25 and Learn with Haya Husain

Boston may be buzzing this September with Splunk University and .conf25, but you don’t have to pack a bag to ...

Splunk Lantern’s Guide to The Most Popular .conf25 Sessions

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...