All Apps and Add-ons

I have 3 indexers, why does the Master Node crash after one of them restarts?

emzet
Explorer

Hello, 

I have 3 indexers. After one of them was restarted then Master Node crash and create crash log every minutes (when indexer try connect to cluster)

Below crash log:

 

[build cd0848707637] 2022-03-29 17:48:34
Received fatal signal 6 (Aborted) on PID 3183981.
 Cause:
   Signal sent by PID 3183981 running under UID 1004.
 Crashing thread: CMAddPeerWorker-5
 Registers:
    RIP:  [0x00007FDB3792137F] gsignal + 271 (libc.so.6 + 0x3737F)
    RDI:  [0x0000000000000002]
    RSI:  [0x00007FDB121F9860]
    RBP:  [0x00007FDB37A74698]
    RSP:  [0x00007FDB121F9860]
    RAX:  [0x0000000000000000]
    RBX:  [0x0000000000000006]
    RCX:  [0x00007FDB3792137F]
    RDX:  [0x0000000000000000]
    R8:  [0x0000000000000000]
    R9:  [0x00007FDB121F9860]
    R10:  [0x0000000000000008]
    R11:  [0x0000000000000246]
    R12:  [0x0000555F4AA9B818]
    R13:  [0x0000555F4A93BC02]
    R14:  [0x00000000000003C2]
    R15:  [0x00007FDB16506238]
    EFL:  [0x0000000000000246]
    TRAPNO:  [0x0000000000000000]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x002B000000000033]
    OLDMASK:  [0x0000000000000000]

 OS: Linux
 Arch: x86-64

 Backtrace (PIC build):
  [0x00007FDB3792137F] gsignal + 271 (libc.so.6 + 0x3737F)
  [0x00007FDB3790BDB5] abort + 295 (libc.so.6 + 0x21DB5)
  [0x00007FDB3790BC89] ? (libc.so.6 + 0x21C89)
  [0x00007FDB37919A76] ? (libc.so.6 + 0x2FA76)
  [0x0000555F497B294F] _ZN8CMBucket14setRASummariesERK4GuidRKSt3mapI3Str15CMBucketSummarySt4lessIS4_ESaISt4pairIKS4_S5_EEE + 623 (splunkd + 0x28C694F)
  [0x0000555F496C13C8] _ZN15CMAddPeerWorker15finishAddBucketERP8CMBucketR15BucketCSVStruct + 136 (splunkd + 0x27D53C8)
  [0x0000555F496C2320] _ZN15CMAddPeerWorker19addStandaloneBucketERK13IndexDataTypeR15BucketCSVStruct + 128 (splunkd + 0x27D6320)
  [0x0000555F496C24B3] _ZN15CMAddPeerWorker20processBucketBatchesEv + 291 (splunkd + 0x27D64B3)
  [0x0000555F48757588] _ZN15CMAddPeerWorker4mainEv + 552 (splunkd + 0x186B588)
  [0x0000555F4959B917] _ZN6Thread8callMainEPv + 135 (splunkd + 0x26AF917)
  [0x00007FDB37CB717A] ? (libpthread.so.0 + 0x817A)
  [0x00007FDB379E6DC3] clone + 67 (libc.so.6 + 0xFCDC3)
 Linux / splunk-master-prod-01.local.ad / 4.18.0-240.1.1.el8_3.x86_64 / #1 SMP Fri Oct 16 13:36:46 EDT 2020 / x86_64
 Libc abort message: splunkd: /opt/splunk/src/clustering/CMBucket.cpp:962: void CMBucket::setRASummaries(const Guid&, const CMBucketSummaries&): Assertion `hasPeer(peer)' failed.

 /etc/redhat-release: Red Hat Enterprise Linux release 8.5 (Ootpa)
 glibc version: 2.28
 glibc release: stable
Last errno: 0
Threads running: 103
Runtime: 56.398836s
argv: [splunkd --under-systemd --systemd-delegate=yes -p 8089 _internal_launch_under_systemd]
Regex JIT enabled

RE2 regex engine enabled

using CLOCK_MONOTONIC
Thread: "CMAddPeerWorker-5", did_join=0, ready_to_run=Y, main_thread=N, token=140578878629632
MutexByte: MutexByte-waiting={none}


x86 CPUID registers:
         0: 0000000D 756E6547 6C65746E 49656E69
         1: 000306F0 07040800 FFFA3203 1F8BFBFF
         2: 76036301 00F0B5FF 00000000 00C30000
         3: 00000000 00000000 00000000 00000000
         4: 00000000 00000000 00000000 00000000
         5: 00000000 00000000 00000000 00000000
         6: 00000004 00000000 00000000 00000000
         7: 00000000 00000000 00000000 00000000
         8: 00000000 00000000 00000000 00000000
         9: 00000000 00000000 00000000 00000000
         A: 07300401 000000FF 00000000 00000000
         B: 00000000 00000000 00000047 00000007
         C: 00000000 00000000 00000000 00000000
          00000000 00000000 00000000 00000000
  80000000: 80000008 00000000 00000000 00000000
  80000001: 00000000 00000000 00000021 2C100800
  80000002: 65746E49 2952286C 6F655820 2952286E
  80000003: 55504320 2D354520 30383632 20347620
  80000004: 2E322040 48473034 0000007A 00000000
  80000005: 00000000 00000000 00000000 00000000
  80000006: 00000000 00000000 01006040 00000000
  80000007: 00000000 00000000 00000000 00000100
  80000008: 0000302B 00000000 00000000 00000000
terminating...

 

And indexer-1 (that one that was rebooted) cannot join to cluster. 

Has anyone had this problem and how to deal with it?

If more info needed im able to send it.

TuanLDA
Observer

Can you give more details?

Tags (1)
0 Karma

spelunkingsplnk
Splunk Employee
Splunk Employee

Did you ever figure out this issue? I'm experiencing a very similar issue. 3 Indexers, restarted 1 of the indexers and the Master node crashed. I even got the same error message as you:

splunkd: /opt/splunk/src/clustering/CMBucket.cpp:962: void CMBucket::setRASummaries(const Guid&, const CMBucketSummaries&): Assertion 'hasPeer(peer)' failed.

emzet
Explorer

The problem turned out to be in one of the Indexers, docked with buckets that did not have a suffix with their GUID.
After adding it, the cluster started working normally.

0 Karma
Get Updates on the Splunk Community!

Optimize Cloud Monitoring

  TECH TALKS Optimize Cloud Monitoring Tuesday, August 13, 2024  |  11:00AM–12:00PM PST   Register to ...

What's New in Splunk Cloud Platform 9.2.2403?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2403! Analysts can ...

Stay Connected: Your Guide to July and August Tech Talks, Office Hours, and Webinars!

Dive into our sizzling summer lineup for July and August Community Office Hours and Tech Talks. Scroll down to ...