Heavy forwarder or indexer crashes with FATAL error on typing thread. Note: Issue is now fixed for next 9.2.2/9.1.5/9.0.10 patches Crashing thread: typing_0
Backtrace (PIC build):
[0x00007F192F4C2ACF] gsignal + 271 (libc.so.6 + 0x4EACF)
[0x00007F192F495EA5] abort + 295 (libc.so.6 + 0x21EA5)
[0x000055E24388D6C0] ? (splunkd + 0x1A366C0)
[0x000055E24388D770] ? (splunkd + 0x1A36770)
[0x000055E2445D6D24] PipelineInputChannelReference::PipelineInputChannelReference(Str const**, PipelineInputChannelSet*, bool) + 388 (splunkd + 0x277FD24)
[0x000055E2445BACC3] PipelineData::set_channel(Str const*, Str const*, Str const*) + 243 (splunkd + 0x2763CC3)
[0x000055E2445BAF9E] PipelineData::recomputeConfKey(PipelineSet*, bool) + 286 (splunkd + 0x2763F9E)
[0x000055E243E3689E] RegexExtractionProcessor::each(CowPipelineData&, PipelineDataVector*, bool) + 718 (splunkd + 0x1FDF89E)
[0x000055E243E36BF3] RegexExtractionProcessor::executeMulti(PipelineDataVector&, PipelineDataVector*) + 67 (splunkd + 0x1FDFBF3)
[0x000055E243BCD5F2] Pipeline::main() + 1074 (splunkd + 0x1D765F2)
[0x000055E244C336FD] Thread::_callMainAndDiscardTerminateException() + 13 (splunkd + 0x2DDC6FD)
[0x000055E244C345F2] Thread::callMain(void*) + 178 (splunkd + 0x2DDD5F2)
[0x00007F192FF1F1CA] ? (libpthread.so.0 + 0x81CA)
[0x00007F192F4ADE73] clone + 67 (libc.so.6 + 0x39E73) Crashing thread: typing_0
Backtrace (PIC build):
[0x00007F192F4C2ACF] gsignal + 271 (libc.so.6 + 0x4EACF)
[0x00007F192F495EA5] abort + 295 (libc.so.6 + 0x21EA5)
[0x000055E24388D6C0] ? (splunkd + 0x1A366C0)
[0x000055E24388D770] ? (splunkd + 0x1A36770)
[0x000055E2445D6D24] _ZN29PipelineInputChannelReferenceC2EPPK3StrP23PipelineInputChannelSetb + 388 (splunkd + 0x277FD24)
[0x000055E2445BACC3] _ZN12PipelineData11set_channelEPK3StrS2_S2_ + 243 (splunkd + 0x2763CC3)
[0x000055E2445BAF9E] _ZN12PipelineData16recomputeConfKeyEP11PipelineSetb + 286 (splunkd + 0x2763F9E)
[0x000055E243E3689E] _ZN24RegexExtractionProcessor4eachER15CowPipelineDataP18PipelineDataVectorb + 718 (splunkd + 0x1FDF89E)
[0x000055E243E36BF3] _ZN24RegexExtractionProcessor12executeMultiER18PipelineDataVectorPS0_ + 67 (splunkd + 0x1FDFBF3)
[0x000055E243BCD5F2] _ZN8Pipeline4mainEv + 1074 (splunkd + 0x1D765F2)
[0x000055E244C336FD] _ZN6Thread37_callMainAndDiscardTerminateExceptionEv + 13 (splunkd + 0x2DDC6FD)
[0x000055E244C345F2] _ZN6Thread8callMainEPv + 178 (splunkd + 0x2DDD5F2)
[0x00007F192FF1F1CA] ? (libpthread.so.0 + 0x81CA)
[0x00007F192F4ADE73] clone + 67 (libc.so.6 + 0x39E73)
Last few lines of stderr (may contain info on assertion failure, but also could be old):
Fatal thread error: pthread_mutex_lock: Invalid argument; 117 threads active, in typing_0 Fatal thread error: pthread_mutex_lock: Invalid argument; This crash happens if persistent queue is enabled. It has been reported for several years. I see one reported back in 2015 as well. https://community.splunk.com/t5/Monitoring-Splunk/What-would-cause-a-Fatal-thread-error-in-thread-typing-found-in/m-p/261407 The bug existed always but the interesting part is, since 9.x the frequency of crashes has gone up. More customers are reporting the crashes now. The probability of hitting the race condition has gone up now. We are fixing the issue( internal ticket SPL-251434) for next patch, in the mean time here are few workarounds to consider depending on what is feasible for your requirement. The reason for 9.x high frequency of crashes on instance with persistent queue enabled is that the forwarders(UF/HF/IUF/IHF) are sending data at faster rate due to 9.x autoBatch, thus small in-memory part of persistent queue (default 500KB) makes it nearly impossible to not bring persistent queue part (writing on to disk) into play. Meaning now 9.x receiver with persistent queue is writing on to disk nearly all the time even when down stream pipeline queues are not saturated. So the best solution to bring crashing frequency to the level of 8.x or older is to increase in-memory part of persistent queue ( so that if no down stream queues full not disk writes to persistent queue). However the fundamental bug still remains there and will be fixed in a patch. The workarounds are reducing the possibility of disk writes for persistent queue. So have a look at the 3 possible workarounds and see which one works for you. 1. Turn off persistent queue on splunktcpin port( I sure not feasible for all). This will eliminate the crash. 2. Disable `splunk_internal_metrics` app as it does source type cloning for metrics.log. Most of us probably not aware that metrics.log is cloned and additionally indexed into `_metrics` index. If you are not using `_metrics` index, disable the app. For crash to happen, you need two conditions a) persistent queue b) sourcetype cloning. 3. Apply following configs to reduce the chances of crashes. limits.conf [input_channels] max_inactive=300001 lowater_inactive=300000 inactive_eligibility_age_seconds=120 inputs.conf, increase im-memory queue size of PQ( depending on ssl or non-ssl port) [splunktcp-ssl:<port>] queueSize=100MB [splunktcp:<port>] queueSize=100MB Enable Async Forwarding on HF/IUF/IHF (crashing instance) 4. Slow down forwarders by setting `autoBatch=false` on all universal forwarders/heavy forwarders .
... View more