I'm noticing that our indexers are crashing, and not coming back gracefully. I've looked in the logs, and keep seeing segfault errors. It really put extra strain on the system when 3-4 indexers go down all at once. I'm thinking it has something to do with the time, but I'm not sure yet.
Cause:
Unknown signal origin (si_code=128, si_addr=[0x0000000000000000]).
Crashing thread: indexerPipe_1
Registers:
RIP: [0x000055923E1027F0] _ZN14IndexProcessor18rollAllHotForIndexERK3StriS2_RKSt13unordered_mapIS0_6ObjRefI11IndexWriterE8hash_str6eq_strSaISt4pairIS1_S6_EEE + 592 (splunkd + 0xA6C7F0)
RDI: [0xFFFFFFFFFFFFFFF7]
RSI: [0x0000000000000004]
RBP: [0x0000000000000001]
RSP: [0x00007F47507FEA30]
RAX: [0x0000000000000000]
RBX: [0x0E00000001000000]
RCX: [0x0000000000000000]
RDX: [0x0000000000000400]
R8: [0x000055923F4E6449]
R9: [0x00007F475AC9D130]
R10: [0x00007F476C9B1D50]
R11: [0x00007F476B000080]
R12: [0x00007F472B7D0670]
R13: [0x00007F472B7D05D0]
R14: [0x0000000000000000]
R15: [0x00007F472B7D06C0]
EFL: [0x0000000000010246]
TRAPNO: [0x000000000000000D]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]
OS: Linux
Arch: x86-64
using CLOCK_MONOTONIC
Thread: "indexerPipe_1", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f475082a010:
00000000 00 f7 7f 50 47 7f 00 00 |...PG...|
00000008
x86 CPUID registers:
0: 0000000F 756E6547 6C65746E 49656E69
1: 000306F2 00100800 7FFEFBFF BFEBFBFF
2: 76036301 00F0B5FF 00000000 00C10000
3: 00000000 00000000 00000000 00000000
4: 00000000 00000000 00000000 00000000
5: 00000040 00000040 00000003 00002120
6: 00000077 00000002 00000009 00000000
7: 00000000 00000000 00000000 00000000
8: 00000000 00000000 00000000 00000000
9: 00000001 00000000 00000000 00000000
A: 07300403 00000000 00000000 00000603
B: 00000000 00000000 000000AD 00000000
C: 00000000 00000000 00000000 00000000
😧 00000000 00000000 00000000 00000000
E: 00000000 00000000 00000000 00000000
F: 00000000 00000000 00000000 00000000
80000000: 80000008 00000000 00000000 00000000
80000001: 00000000 00000000 00000021 2C100800
80000002: 65746E49 2952286C 6F655820 2952286E
80000003: 55504320 2D354520 30343632 20337620
80000004: 2E322040 48473036 0000007A 00000000
80000005: 00000000 00000000 00000000 00000000
80000006: 00000000 00000000 01006040 00000000
80000007: 00000000 00000000 00000000 00000100
80000008: 0000302E 00000000 00000000 00000000
terminating...
I've seen this in a couple of environments, so I don't think it's a unique problem.
If it's not a bug then it must be a bug
Hi,
did you perform any kind of update lately, Splunk Enterprise, NFS, OS?
Seen behavior like this in all 3 cases. Maybe it helps to identify when this started.
How can a customer review the Bug information? SPL-148969
Submit a support case, they can tell you.
We had exactly the same problem and it was started after upgrade to version 7.0.2.
After opening a case in Splunk, they instructed us to upgrade to version 7.0.3 or higher because it's a bug that was fixed in "SPL-148969, SPL-148600 Indexer may crash during hot bucket rolling following a streaming failure".
Hope this helps you.