Splunk Enterprise

Why does my indexer keep on crashing- IndexerTPoolWorker by abort

sylim_splunk
Splunk Employee
Splunk Employee

Indexer rebooted no-gracefully. After reboot Splunk starts generating crash files shortly after restart. Spent the last two days running fsck repair on all buckets. doesn't seem to have helped.

No relevant errors in the splunkd.log.

Crash log files:

crash-2022-09-07-14:45:07.log
crash-2022-09-07-14:45:15.log
crash-2022-09-07-14:45:24.log
crash-2022-09-07-14:45:32.log
crash-2022-09-07-14:45:40.log

Crash log: every crash log has the same patterns as below by changing Crashing threads, such as IndexerTPoolWorker-2, IndexerTPoolWorker-4, IndexerTPoolWorker-7 and the like..

[build 87344edfcdb4] 2022-09-07 13:31:01
Received fatal signal 6 (Aborted) on PID 193171.
Cause:
Signal sent by PID 193171 running under UID 53292.
Crashing thread: IndexerTPoolWorker-2

Backtrace (PIC build):
[0x00007EFDA0A27387] gsignal + 55 (libc.so.6 + 0x36387)
[0x00007EFDA0A28A78] abort + 328 (libc.so.6 + 0x37A78)
[0x00007EFDA0A201A6] ? (libc.so.6 + 0x2F1A6)
[0x00007EFDA0A20252] ? (libc.so.6 + 0x2F252)
[0x000056097778BA2C] ReadableJournalSliceDirectory::findEventTimeRange(int*, int*, bool)

...

Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

Labels (1)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

Here's the steps to find the culprits.

There are 17 crash logs found in the diag file. The crashing point from crash log files are all the same as below,
---- excerpts ---
Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

------------------
Which suggests that there must be corrupt bucket(s) or truncated but the crash log files don't have details of which bucket(s) being corrupt at all.

To find the relevant logs in splunkd.log by matching the time, 14:45:24 and thread number such as, IndexerTPoolWorker-4 from the crash log file , I found the logs as below;


- In crash-2022-09-07-14:45:24.log:   
Crashing thread: IndexerTPoolWorker-4

Check the splunkd.log to find the logs created at 14:45:24 by IndexerTPoolWorker-4 ;

splunkd.log: 09-07-2022 14:45:24.025 -0700 INFO HotBucketRoller [322700 IndexerTPoolWorker-4] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- In crash-2022-09-07-14:45:32.log: Crashing thread: IndexerTPoolWorker-7

Check splunkd.log to find the logs created at 14:45:32 by IndexerTPoolWorker-7 ;


splunkd.log: 09-07-2022 14:45:32.238 -0700 INFO HotBucketRoller [322983 IndexerTPoolWorker-7] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- crash-2022-09-07-14:45:40.log: Crashing thread: IndexerTPoolWorker-6


splunkd.log: 09-07-2022 14:45:40.566 -0700 INFO HotBucketRoller [323220 IndexerTPoolWorker-6] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'


The crash log time and the thread number, IndexerTPoolWorker-# matches and the log messages pointing to the same bucket, '/storage/tier1/20000_idx2/db/hot_v1_432'.

As it's suspected to have gone beyond the fsck repair we moved it out of the $SPLUNK_DB and then the indexer came up and running as the other peers.

 

This covers the specific situation of a corrupt bucket causing the indexer to crash and how to find the corrupt buckets.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

Here's the steps to find the culprits.

There are 17 crash logs found in the diag file. The crashing point from crash log files are all the same as below,
---- excerpts ---
Libc abort message: splunkd: /opt/splunk/src/pipeline/indexer/JournalSlice.cpp:1780: bool ReadableJournalSliceDirectory::findEventTimeRange(st_time_t*, st_time_t*, bool): Assertion `tell() == pos' failed.

------------------
Which suggests that there must be corrupt bucket(s) or truncated but the crash log files don't have details of which bucket(s) being corrupt at all.

To find the relevant logs in splunkd.log by matching the time, 14:45:24 and thread number such as, IndexerTPoolWorker-4 from the crash log file , I found the logs as below;


- In crash-2022-09-07-14:45:24.log:   
Crashing thread: IndexerTPoolWorker-4

Check the splunkd.log to find the logs created at 14:45:24 by IndexerTPoolWorker-4 ;

splunkd.log: 09-07-2022 14:45:24.025 -0700 INFO HotBucketRoller [322700 IndexerTPoolWorker-4] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- In crash-2022-09-07-14:45:32.log: Crashing thread: IndexerTPoolWorker-7

Check splunkd.log to find the logs created at 14:45:32 by IndexerTPoolWorker-7 ;


splunkd.log: 09-07-2022 14:45:32.238 -0700 INFO HotBucketRoller [322983 IndexerTPoolWorker-7] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'

- crash-2022-09-07-14:45:40.log: Crashing thread: IndexerTPoolWorker-6


splunkd.log: 09-07-2022 14:45:40.566 -0700 INFO HotBucketRoller [323220 IndexerTPoolWorker-6] - found hot bucket='/storage/tier1/20000_idx2/db/hot_v1_432'


The crash log time and the thread number, IndexerTPoolWorker-# matches and the log messages pointing to the same bucket, '/storage/tier1/20000_idx2/db/hot_v1_432'.

As it's suspected to have gone beyond the fsck repair we moved it out of the $SPLUNK_DB and then the indexer came up and running as the other peers.

 

This covers the specific situation of a corrupt bucket causing the indexer to crash and how to find the corrupt buckets.

Get Updates on the Splunk Community!

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI!Discover how Splunk’s agentic AI ...

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler

From Idea to Implementation: Why Splunk Built mTLS into Splunk Enterprise 10.0  mTLS wasn’t just a checkbox ...