Our search head cluster environment is crashing at start of hour. Any of the nodes are going down without any notable error in Splunkd.log. The Crash logs provide following information
[build aa7d4b1ccb80] 2016-03-02 11:01:04
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 2461 running under UID 0.
Crashing thread: tailreader0
Registers:
RIP: [0x00007FDDBC1625F7] gsignal + 55 (/lib64/libc.so.6)
RDI: [0x000000000000099D]
RSI: [0x0000000000000CC6]
RBP: [0x00007FDDBC2ABBE8]
RSP: [0x00007FDD9ABFE8C8]
RAX: [0x0000000000000000]
RBX: [0x00007FDDBD6B1000]
RCX: [0x00007FDDBC1625F7]
RDX: [0x0000000000000006]
R8: [0xFEFEFEFEFEFEFEFF]
R9: [0x00007FDDBD6F9F60]
R10: [0x0000000000000008]
R11: [0x0000000000000202]
R12: [0x0000000001813019]
R13: [0x000000000187EFE0]
R14: [0x00007FDD9AC7EC28]
R15: [0x00007FDD9ABFEE68]
EFL: [0x0000000000000202]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]
OS: Linux
Arch: x86-64
Backtrace:
[0x00007FDDBC1625F7] gsignal + 55 (/lib64/libc.so.6)
[0x00007FDDBC163CE8] abort + 328 (/lib64/libc.so.6)
[0x00007FDDBC15B566] ? (/lib64/libc.so.6)
[0x00007FDDBC15B612] ? (/lib64/libc.so.6)
[0x0000000000A503EA] ? (splunkd)
[0x0000000000A4E6C3] _ZNK11TailWatcher12setupConfigsER15WatchedTailFile + 1507 (splunkd)
[0x0000000000A4E7D2] _ZNK11TailWatcher19initializeFileStateER15WatchedTailFileRK8Pathname + 66 (splunkd)
[0x0000000000A679F5] _ZN10TailReader10handleFileEP15WatchedTailFileP11TailWatcher + 69 (splunkd)
[0x0000000000A6A2DA] _ZN12ReaderThread4mainEv + 378 (splunkd)
[0x000000000109F0EE] _ZN6Thread8callMainEPv + 62 (splunkd)
[0x00007FDDBC4F6DC5] ? (/lib64/libpthread.so.0)
[0x00007FDDBC223BDD] clone + 109 (/lib64/libc.so.6)
Linux / ip-10-37-20-183 / 4.1.10-17.31.amzn1.x86_64 / #1 SMP Sat Oct 24 01:31:37 UTC 2015 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
File 'etc/apps/splunk_management_console/default/transforms.conf' changed or missing.
File 'etc/apps/user-prefs/default/app.conf' changed or missing.
File 'etc/apps/user-prefs/default/user-prefs.conf' changed or missing.
File 'etc/system/default/authorize.conf' changed or missing.
File 'etc/system/default/limits.conf' changed or missing.
2016-03-01 19:09:30.757 +0000 splunkd started (build aa7d4b1ccb80)
splunkd: /home/build/build-src/ember/src/pipeline/input/Tailing.h:178: bool StatWrap::isDir() const: Assertion _valid' failed.
_valid' failed.
2016-03-02 06:04:02.168 +0000 splunkd started (build aa7d4b1ccb80)
splunkd: /home/build/build-src/ember/src/pipeline/input/Tailing.h:178: bool StatWrap::isDir() const: Assertion
glibc version: 2.17
glibc release: stable
Last errno: 2
Threads running: 85
Runtime: 17821.939688s
argv: [splunkd -p 8089 start]
Thread: "tailreader0", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7fdda64b6610:
00000000 00 f7 bf 9a dd 7f 00 00 |........|
00000008
ReaderThread: mode=0, queueSize=3, shutdown=N, reconfigure=N, mode=0
Reading File-WatchedTailFile-WatchedFileState: path="/etc/passwd.20512", flags=0x24233
First 144 bytes of PathnameStat @0x7fdd8f18c408:
00000000 01 ca 00 00 00 00 00 00 95 08 04 00 00 00 00 00 |................|
00000010 01 00 00 00 00 00 00 00 80 81 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 06 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 |................|
00000040 08 00 00 00 00 00 00 00 ef c7 d6 56 00 00 00 00 |...........V....|
00000050 9f af 83 05 00 00 00 00 ef c7 d6 56 00 00 00 00 |...........V....|
00000060 9f af 83 05 00 00 00 00 ef c7 d6 56 00 00 00 00 |...........V....|
00000070 9f af 83 05 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090
FilesystemChangeWatcher: _timeoutActive=N, _throttled=N, _waitingForNotifyCount=4
EMPTY Q: waitingForTimeout=N, noAction=N, stat=Y, immediateStat=Y, readdir=Y, notify=Y
Timeout: _when = 8017369883810230355.7291118079607532909, _initialMsec = 8242823450149347693
file-in: _initialized=Y, _lastCharWasNewline=N, _lastReadHadNulls=N, _wasCrcConflict=N, _warned=N
_nullsWarned=N, _wasTooNew=N, _exists=N, _noDebug=N
_hadExplicitSource=N, _crossedInitCrcLenBoundary=N, _classifiedAtLeastOnce=N, _fileReplaced=N, _readPathAfterRealEOF=N
_onlyNotifiedOnce=Y, _isArchive=N, _isCached=111213, _unowned=N, _deleteOnEOF=N
_overrideDeleteOnEOF=N, _doNotDeleteChildren=N, _readFromEnd=N, _readIrregardless=N
_fileCheckMethod=0, _crcSalt=
_bytesRead=0, _storingBytesRead=0, _initCrc=0x0, _seekCrc=0x0
_filenameCrc=0xb6f41d1c1b5da0d8, _fallbackCrc=0x0, _lastEOFTime=
_eofSeconds=3, _ignoreThresh=
_pendingMetadata=
_prevFd=-1, _pdModels=[0 PDs]
_rescheduleDelay=1000, _rescheduleTarget=
_st=[dev=51713, ino=264341, mode=100600, size=6, mtime=1456916463, owner=0, group=0]
_toStringPrefix=state=0x0x7fdd8f18c380, _backoff=0
_stdataInputHeaderProcessing=[]
_detectTrailingNulls=N, _detectReadingFromOffSet=N, _readAndSkipHeader=N, _uniqueId=0
_rawPath=
x86 CPUID registers:
0: 0000000D 756E6547 6C65746E 49656E69
1: 000306E4 07080800 FFBA2203 178BFBFF
2: 76036301 00F0B2FF 00000000 00CA0000
3: 00000000 00000000 00000000 00000000
4: 00000000 00000000 00000000 00000000
5: 00000000 00000000 00000000 00000000
6: 00000000 00000000 00000000 00000000
7: 00000000 00000000 00000000 00000000
8: 00000000 00000000 00000000 00000000
9: 00000000 00000000 00000000 00000000
A: 00000000 00000000 00000000 00000000
B: 00000000 00000000 00000000 00000007
C: 00000000 00000000 00000000 00000000
😧 00000000 00000000 00000000 00000000
80000000: 80000008 00000000 00000000 00000000
80000001: 00000000 00000000 00000001 28100800
80000002: 20202020 6E492020 286C6574 58202952
80000003: 286E6F65 43202952 45205550 36322D35
80000004: 76203038 20402032 30382E32 007A4847
80000005: 00000000 00000000 00000000 00000000
80000006: 00000000 00000000 01006040 00000000
80000007: 00000000 00000000 00000000 00000000
80000008: 0000302E 00000000 00000000 00000000
terminating...
One pattern that I found, is every time the crash logs are getting generated at fix hh:01:03.log.
I have gone through the list of alerts getting triggered at start of the hour, but couldn't find any solution.
We are using Splunk 6.3.0, on machine with 8 CPU and 16 GB RAM.
The dmesg shows following output :
[5641614.805910] splunkd[5823]: segfault at 10 ip 00000000014d7a91 sp 00007f5987bfc860 error 4 in splunkd[400000+1ade000]
From the docs, we found that this issue was supposed to be fixed with 6.2.3, but we faced it on even 6.3.0.
We gave a try , by upgrading 6.3.0 to 6.3.3 and it worked. After upgrade, we have not received this issue for past 20 days.