Deployment Architecture

Why do my Indexers, on Linux, keep segfaulting randomly?

Explorer

I'm noticing that our indexers are crashing, and not coming back gracefully. I've looked in the logs, and keep seeing segfault errors. It really put extra strain on the system when 3-4 indexers go down all at once. I'm thinking it has something to do with the time, but I'm not sure yet.

Cause:

Unknown signal origin (si_code=128, si_addr=[0x0000000000000000]).
 Crashing thread: indexerPipe_1
 Registers:
    RIP:  [0x000055923E1027F0] _ZN14IndexProcessor18rollAllHotForIndexERK3StriS2_RKSt13unordered_mapIS0_6ObjRefI11IndexWriterE8hash_str6eq_strSaISt4pairIS1_S6_EEE + 592 (splunkd + 0xA6C7F0)
    RDI:  [0xFFFFFFFFFFFFFFF7]
    RSI:  [0x0000000000000004]
    RBP:  [0x0000000000000001]
    RSP:  [0x00007F47507FEA30]
    RAX:  [0x0000000000000000]
    RBX:  [0x0E00000001000000]
    RCX:  [0x0000000000000000]
    RDX:  [0x0000000000000400]
    R8:  [0x000055923F4E6449]
    R9:  [0x00007F475AC9D130]
    R10:  [0x00007F476C9B1D50]
    R11:  [0x00007F476B000080]
    R12:  [0x00007F472B7D0670]
    R13:  [0x00007F472B7D05D0]
    R14:  [0x0000000000000000]
    R15:  [0x00007F472B7D06C0]
    EFL:  [0x0000000000010246]
    TRAPNO:  [0x000000000000000D]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x0000000000000033]
    OLDMASK:  [0x0000000000000000]

 OS: Linux
 Arch: x86-64

using CLOCK_MONOTONIC
Thread: "indexerPipe_1", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f475082a010:
00000000  00 f7 7f 50 47 7f 00 00                           |...PG...|
00000008


x86 CPUID registers:
         0: 0000000F 756E6547 6C65746E 49656E69
         1: 000306F2 00100800 7FFEFBFF BFEBFBFF
         2: 76036301 00F0B5FF 00000000 00C10000
         3: 00000000 00000000 00000000 00000000
         4: 00000000 00000000 00000000 00000000
         5: 00000040 00000040 00000003 00002120
         6: 00000077 00000002 00000009 00000000
         7: 00000000 00000000 00000000 00000000
         8: 00000000 00000000 00000000 00000000
         9: 00000001 00000000 00000000 00000000
         A: 07300403 00000000 00000000 00000603
         B: 00000000 00000000 000000AD 00000000
         C: 00000000 00000000 00000000 00000000
         😧 00000000 00000000 00000000 00000000
         E: 00000000 00000000 00000000 00000000
         F: 00000000 00000000 00000000 00000000
  80000000: 80000008 00000000 00000000 00000000
  80000001: 00000000 00000000 00000021 2C100800
  80000002: 65746E49 2952286C 6F655820 2952286E
  80000003: 55504320 2D354520 30343632 20337620
  80000004: 2E322040 48473036 0000007A 00000000
  80000005: 00000000 00000000 00000000 00000000
  80000006: 00000000 00000000 01006040 00000000
  80000007: 00000000 00000000 00000000 00000100
  80000008: 0000302E 00000000 00000000 00000000
terminating...

I've seen this in a couple of environments, so I don't think it's a unique problem.

0 Karma

SplunkTrust
SplunkTrust

If it's not a bug then it must be a bug

0 Karma

Influencer

Hi,

did you perform any kind of update lately, Splunk Enterprise, NFS, OS?

Seen behavior like this in all 3 cases. Maybe it helps to identify when this started.

0 Karma

New Member

How can a customer review the Bug information? SPL-148969

0 Karma

SplunkTrust
SplunkTrust

Submit a support case, they can tell you.

0 Karma

New Member

We had exactly the same problem and it was started after upgrade to version 7.0.2.
After opening a case in Splunk, they instructed us to upgrade to version 7.0.3 or higher because it's a bug that was fixed in "SPL-148969, SPL-148600 Indexer may crash during hot bucket rolling following a streaming failure".

Hope this helps you.

0 Karma