Splunk Search Head Cluster Crashing at start of ho...

hardikJsheth · ‎03-03-2016

Our search head cluster environment is crashing at start of hour. Any of the nodes are going down without any notable error in Splunkd.log. The Crash logs provide following information

[build aa7d4b1ccb80] 2016-03-02 11:01:04
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 2461 running under UID 0.
Crashing thread: tailreader0
Registers:
RIP: [0x00007FDDBC1625F7] gsignal + 55 (/lib64/libc.so.6)
RDI: [0x000000000000099D]
RSI: [0x0000000000000CC6]
RBP: [0x00007FDDBC2ABBE8]
RSP: [0x00007FDD9ABFE8C8]
RAX: [0x0000000000000000]
RBX: [0x00007FDDBD6B1000]
RCX: [0x00007FDDBC1625F7]
RDX: [0x0000000000000006]
R8: [0xFEFEFEFEFEFEFEFF]
R9: [0x00007FDDBD6F9F60]
R10: [0x0000000000000008]
R11: [0x0000000000000202]
R12: [0x0000000001813019]
R13: [0x000000000187EFE0]
R14: [0x00007FDD9AC7EC28]
R15: [0x00007FDD9ABFEE68]
EFL: [0x0000000000000202]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]

OS: Linux
Arch: x86-64

Backtrace:
[0x00007FDDBC1625F7] gsignal + 55 (/lib64/libc.so.6)
[0x00007FDDBC163CE8] abort + 328 (/lib64/libc.so.6)
[0x00007FDDBC15B566] ? (/lib64/libc.so.6)
[0x00007FDDBC15B612] ? (/lib64/libc.so.6)
[0x0000000000A503EA] ? (splunkd)
[0x0000000000A4E6C3] _ZNK11TailWatcher12setupConfigsER15WatchedTailFile + 1507 (splunkd)
[0x0000000000A4E7D2] _ZNK11TailWatcher19initializeFileStateER15WatchedTailFileRK8Pathname + 66 (splunkd)
[0x0000000000A679F5] _ZN10TailReader10handleFileEP15WatchedTailFileP11TailWatcher + 69 (splunkd)
[0x0000000000A6A2DA] _ZN12ReaderThread4mainEv + 378 (splunkd)
[0x000000000109F0EE] _ZN6Thread8callMainEPv + 62 (splunkd)
[0x00007FDDBC4F6DC5] ? (/lib64/libpthread.so.0)
[0x00007FDDBC223BDD] clone + 109 (/lib64/libc.so.6)
Linux / ip-10-37-20-183 / 4.1.10-17.31.amzn1.x86_64 / #1 SMP Sat Oct 24 01:31:37 UTC 2015 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
File 'etc/apps/splunk_management_console/default/transforms.conf' changed or missing.
File 'etc/apps/user-prefs/default/app.conf' changed or missing.
File 'etc/apps/user-prefs/default/user-prefs.conf' changed or missing.
File 'etc/system/default/authorize.conf' changed or missing.
File 'etc/system/default/limits.conf' changed or missing.
2016-03-01 19:09:30.757 +0000 splunkd started (build aa7d4b1ccb80)
splunkd: /home/build/build-src/ember/src/pipeline/input/Tailing.h:178: bool StatWrap::isDir() const: Assertion _valid' failed. 2016-03-02 06:04:02.168 +0000 splunkd started (build aa7d4b1ccb80) splunkd: /home/build/build-src/ember/src/pipeline/input/Tailing.h:178: bool StatWrap::isDir() const: Assertion_valid' failed.

glibc version: 2.17
glibc release: stable
Last errno: 2
Threads running: 85
Runtime: 17821.939688s
argv: [splunkd -p 8089 start]
Thread: "tailreader0", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7fdda64b6610:
00000000 00 f7 bf 9a dd 7f 00 00 |........|
00000008
ReaderThread: mode=0, queueSize=3, shutdown=N, reconfigure=N, mode=0
Reading File-WatchedTailFile-WatchedFileState: path="/etc/passwd.20512", flags=0x24233
First 144 bytes of PathnameStat @0x7fdd8f18c408:
00000000 01 ca 00 00 00 00 00 00 95 08 04 00 00 00 00 00 |................|
00000010 01 00 00 00 00 00 00 00 80 81 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 06 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 |................|
00000040 08 00 00 00 00 00 00 00 ef c7 d6 56 00 00 00 00 |...........V....|
00000050 9f af 83 05 00 00 00 00 ef c7 d6 56 00 00 00 00 |...........V....|
00000060 9f af 83 05 00 00 00 00 ef c7 d6 56 00 00 00 00 |...........V....|
00000070 9f af 83 05 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090
FilesystemChangeWatcher: _timeoutActive=N, _throttled=N, _waitingForNotifyCount=4
EMPTY Q: waitingForTimeout=N, noAction=N, stat=Y, immediateStat=Y, readdir=Y, notify=Y
Timeout: _when = 8017369883810230355.7291118079607532909, _initialMsec = 8242823450149347693
file-in: _initialized=Y, _lastCharWasNewline=N, _lastReadHadNulls=N, _wasCrcConflict=N, _warned=N
_nullsWarned=N, _wasTooNew=N, _exists=N, _noDebug=N
_hadExplicitSource=N, _crossedInitCrcLenBoundary=N, _classifiedAtLeastOnce=N, _fileReplaced=N, _readPathAfterRealEOF=N
_onlyNotifiedOnce=Y, _isArchive=N, _isCached=111213, _unowned=N, _deleteOnEOF=N
_overrideDeleteOnEOF=N, _doNotDeleteChildren=N, _readFromEnd=N, _readIrregardless=N
_fileCheckMethod=0, _crcSalt=, _origPath=
_bytesRead=0, _storingBytesRead=0, _initCrc=0x0, _seekCrc=0x0
_filenameCrc=0xb6f41d1c1b5da0d8, _fallbackCrc=0x0, _lastEOFTime=, _modTime=
_eofSeconds=3, _ignoreThresh=, _initCrcBytes=256, _initCrcForBatch=0x0
_pendingMetadata=
_prevFd=-1, _pdModels=[0 PDs]
_rescheduleDelay=1000, _rescheduleTarget=, _name=/etc/passwd.20512, _statusName=
_st=[dev=51713, ino=264341, mode=100600, size=6, mtime=1456916463, owner=0, group=0]
_toStringPrefix=state=0x0x7fdd8f18c380, _backoff=0
_stdataInputHeaderProcessing=[]

     _detectTrailingNulls=N, _detectReadingFromOffSet=N, _readAndSkipHeader=N, _uniqueId=0

_rawPath=

x86 CPUID registers:
0: 0000000D 756E6547 6C65746E 49656E69
1: 000306E4 07080800 FFBA2203 178BFBFF
2: 76036301 00F0B2FF 00000000 00CA0000
3: 00000000 00000000 00000000 00000000
4: 00000000 00000000 00000000 00000000
5: 00000000 00000000 00000000 00000000
6: 00000000 00000000 00000000 00000000
7: 00000000 00000000 00000000 00000000
8: 00000000 00000000 00000000 00000000
9: 00000000 00000000 00000000 00000000
A: 00000000 00000000 00000000 00000000
B: 00000000 00000000 00000000 00000007
C: 00000000 00000000 00000000 00000000
😧 00000000 00000000 00000000 00000000
80000000: 80000008 00000000 00000000 00000000
80000001: 00000000 00000000 00000001 28100800
80000002: 20202020 6E492020 286C6574 58202952
80000003: 286E6F65 43202952 45205550 36322D35
80000004: 76203038 20402032 30382E32 007A4847
80000005: 00000000 00000000 00000000 00000000
80000006: 00000000 00000000 01006040 00000000
80000007: 00000000 00000000 00000000 00000000
80000008: 0000302E 00000000 00000000 00000000
terminating...

One pattern that I found, is every time the crash logs are getting generated at fix hh:01:03.log.
I have gone through the list of alerts getting triggered at start of the hour, but couldn't find any solution.

We are using Splunk 6.3.0, on machine with 8 CPU and 16 GB RAM.

The dmesg shows following output :
[5641614.805910] splunkd[5823]: segfault at 10 ip 00000000014d7a91 sp 00007f5987bfc860 error 4 in splunkd[400000+1ade000]

hardikJsheth · ‎03-30-2016

From the docs, we found that this issue was supposed to be fixed with 6.2.3, but we faced it on even 6.3.0.

We gave a try , by upgrading 6.3.0 to 6.3.3 and it worked. After upgrade, we have not received this issue for past 20 days.

Splunk Search Head Cluster Crashing at start of hour

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms