Getting Data In

Why is our Splunk 6.2.4 forwarder on Linux crashing frequently in our production environment?

armonsal
Explorer

Dear,

I have this crash with the Splunk forwarder 6.2.4 for Linux x64. The forwarder crashes with frequency in a production environment. I don't have a problem with hardware resources.

crash-2015-07-24-07:08:46.log | more
[build 271043] 2015-07-24 07:08:46
Received fatal signal 6 (Aborted).
 Cause:
   Signal sent by PID 25804 running under UID 533.
 Crashing thread: MainTailingThread
 Registers:
    RIP:  [0x0000003FEC6329A5] gsignal + 53 (/lib64/libc.so.6)
    RDI:  [0x00000000000064CC]
    RSI:  [0x00000000000064F0]
    RBP:  [0x0000003FEC740C2D]
    RSP:  [0x00007F7966FFE288]
    RAX:  [0x0000000000000000]
    RBX:  [0x00007FFF6E2DD739]
    RCX:  [0xFFFFFFFFFFFFFFFF]
    RDX:  [0x0000000000000006]
    R8:  [0x00007F7966FFF710]
    R9:  [0x202764696C61765F]
    R10:  [0x0000000000000008]
    R11:  [0x0000000000000206]
    R12:  [0x00000000015B46B5]
    R13:  [0x0000000001614660]
    R14:  [0x0000003FEC740C2D]
    R15:  [0x0000000000000078]
    EFL:  [0x0000000000000206]
    TRAPNO:  [0x0000000000000000]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x0000000000000033]
    OLDMASK:  [0x0000000000000000]

 OS: Linux
 Arch: x86-64

 Backtrace:
  [0x0000003FEC6329A5] gsignal + 53 (/lib64/libc.so.6)
  [0x0000003FEC634185] abort + 373 (/lib64/libc.so.6)
  [0x0000003FEC62B935] __assert_fail + 245 (/lib64/libc.so.6)
  [0x000000000099145A] ? (splunkd)
  [0x000000000098D582] _ZNK11TailWatcher12setupConfigsER15WatchedTailFile + 1474 (splunkd)
  [0x000000000098D692] _ZNK11TailWatcher19initializeFileStateER15WatchedTailFileRK8Pathname + 66 (splunkd)
  [0x00000000009904B2] _ZN11TailWatcher11fileChangedEP16WatchedFileStateRK7Timeval + 242 (splunkd)
  [0x0000000000EC2602] _ZN30FilesystemChangeInternalWorker15callFileChangedER7TimevalP16WatchedFileState + 114 (splunkd)
  [0x0000000000EC3F90] _ZN30FilesystemChangeInternalWorker12when_expiredERy + 464 (splunkd)
  [0x0000000000F53B2D] _ZN11TimeoutHeap18runExpiredTimeoutsER7Timeval + 301 (splunkd)
  [0x0000000000EBD818] _ZN9EventLoop3runEv + 744 (splunkd)
  [0x000000000098E9ED] _ZN11TailWatcher3runEv + 141 (splunkd)
  [0x000000000099428A] _ZN13TailingThread4mainEv + 154 (splunkd)
  [0x0000000000F5165E] _ZN6Thread8callMainEPv + 62 (splunkd)
  [0x0000003FECA077E1] ? (/lib64/libpthread.so.0)
  [0x0000003FEC6E153D] clone + 109 (/lib64/libc.so.6)
 Linux / Dymas / 2.6.32-71.el6.x86_64 / #1 SMP Wed Sep 1 01:33:01 EDT 2010 / x86_64
 Last few lines of stderr (may contain info on assertion failure, but also could be old):
    2015-07-17 12:30:37.739 -0300 splunkd started (build 271043)
    2015-07-17 12:40:16.775 -0300 Interrupt signal received
    2015-07-17 14:31:12.322 -0300 splunkd started (build 271043)
    splunkd: /home/build/build-src/6.2.4/src/pipeline/input/Tailing.h:120: bool StatWrap::isDir() const: Assertion `_valid' failed.
    2015-07-20 22:08:15.270 -0300 splunkd started (build 271043)
    splunkd: /home/build/build-src/6.2.4/src/pipeline/input/Tailing.h:120: bool StatWrap::isDir() const: Assertion `_valid' failed.
    2015-07-22 13:13:13.179 -0300 splunkd started (build 271043)
    splunkd: /home/build/build-src/6.2.4/src/pipeline/input/Tailing.h:120: bool StatWrap::isDir() const: Assertion `_valid' failed.

 /etc/redhat-release: Red Hat Enterprise Linux Server release 6.0 (Santiago)
 glibc version: 2.12
 glibc release: stable
Last errno: 2
Threads running: 28
argv: [splunkd -p 8089 start]
Thread: "MainTailingThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f796701e150:
00000000  10 f7 ff 66 79 7f 00 00                           |...fy...|
00000008

First 512 bytes of Timeout object @0x7f7966ffea88:
00000000  10 f2 6d 01 00 00 00 00  00 00 00 00 00 00 00 00  |..m.............|
00000010  38 e8 ff 66 79 7f 00 00  00 00 00 00 00 00 00 00  |8..fy...........|
00000020  00 00 00 00 00 00 00 00  ad 0e b2 55 00 00 00 00  |...........U....|
[ope_splunk@Dymas splunk]$ cat crash-2015-07-24-07:08:46.log 
[build 271043] 2015-07-24 07:08:46
Received fatal signal 6 (Aborted).
 Cause:
   Signal sent by PID 25804 running under UID 533.
 Crashing thread: MainTailingThread
 Registers:
    RIP:  [0x0000003FEC6329A5] gsignal + 53 (/lib64/libc.so.6)
    RDI:  [0x00000000000064CC]
    RSI:  [0x00000000000064F0]
    RBP:  [0x0000003FEC740C2D]
    RSP:  [0x00007F7966FFE288]
    RAX:  [0x0000000000000000]
    RBX:  [0x00007FFF6E2DD739]
    RCX:  [0xFFFFFFFFFFFFFFFF]
    RDX:  [0x0000000000000006]
    R8:  [0x00007F7966FFF710]
    R9:  [0x202764696C61765F]
    R10:  [0x0000000000000008]
    R11:  [0x0000000000000206]
    R12:  [0x00000000015B46B5]
    R13:  [0x0000000001614660]
    R14:  [0x0000003FEC740C2D]
    R15:  [0x0000000000000078]
    EFL:  [0x0000000000000206]
    TRAPNO:  [0x0000000000000000]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x0000000000000033]
    OLDMASK:  [0x0000000000000000]

 OS: Linux
 Arch: x86-64

 Backtrace:
  [0x0000003FEC6329A5] gsignal + 53 (/lib64/libc.so.6)
  [0x0000003FEC634185] abort + 373 (/lib64/libc.so.6)
  [0x0000003FEC62B935] __assert_fail + 245 (/lib64/libc.so.6)
  [0x000000000099145A] ? (splunkd)
  [0x000000000098D582] _ZNK11TailWatcher12setupConfigsER15WatchedTailFile + 1474 (splunkd)
  [0x000000000098D692] _ZNK11TailWatcher19initializeFileStateER15WatchedTailFileRK8Pathname + 66 (splunkd)
  [0x00000000009904B2] _ZN11TailWatcher11fileChangedEP16WatchedFileStateRK7Timeval + 242 (splunkd)
  [0x0000000000EC2602] _ZN30FilesystemChangeInternalWorker15callFileChangedER7TimevalP16WatchedFileState + 114 (splunkd)
  [0x0000000000EC3F90] _ZN30FilesystemChangeInternalWorker12when_expiredERy + 464 (splunkd)
  [0x0000000000F53B2D] _ZN11TimeoutHeap18runExpiredTimeoutsER7Timeval + 301 (splunkd)
  [0x0000000000EBD818] _ZN9EventLoop3runEv + 744 (splunkd)
  [0x000000000098E9ED] _ZN11TailWatcher3runEv + 141 (splunkd)
  [0x000000000099428A] _ZN13TailingThread4mainEv + 154 (splunkd)
  [0x0000000000F5165E] _ZN6Thread8callMainEPv + 62 (splunkd)
  [0x0000003FECA077E1] ? (/lib64/libpthread.so.0)
  [0x0000003FEC6E153D] clone + 109 (/lib64/libc.so.6)
 Linux / Dymas / 2.6.32-71.el6.x86_64 / #1 SMP Wed Sep 1 01:33:01 EDT 2010 / x86_64
 Last few lines of stderr (may contain info on assertion failure, but also could be old):
    2015-07-17 12:30:37.739 -0300 splunkd started (build 271043)
    2015-07-17 12:40:16.775 -0300 Interrupt signal received
    2015-07-17 14:31:12.322 -0300 splunkd started (build 271043)
    splunkd: /home/build/build-src/6.2.4/src/pipeline/input/Tailing.h:120: bool StatWrap::isDir() const: Assertion `_valid' failed.
    2015-07-20 22:08:15.270 -0300 splunkd started (build 271043)
    splunkd: /home/build/build-src/6.2.4/src/pipeline/input/Tailing.h:120: bool StatWrap::isDir() const: Assertion `_valid' failed.
    2015-07-22 13:13:13.179 -0300 splunkd started (build 271043)
    splunkd: /home/build/build-src/6.2.4/src/pipeline/input/Tailing.h:120: bool StatWrap::isDir() const: Assertion `_valid' failed.

 /etc/redhat-release: Red Hat Enterprise Linux Server release 6.0 (Santiago)
 glibc version: 2.12
 glibc release: stable
Last errno: 2
Threads running: 28
argv: [splunkd -p 8089 start]
Thread: "MainTailingThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7f796701e150:
00000000  10 f7 ff 66 79 7f 00 00                           |...fy...|
00000008

First 512 bytes of Timeout object @0x7f7966ffea88:
00000000  10 f2 6d 01 00 00 00 00  00 00 00 00 00 00 00 00  |..m.............|
00000010  38 e8 ff 66 79 7f 00 00  00 00 00 00 00 00 00 00  |8..fy...........|
00000020  00 00 00 00 00 00 00 00  ad 0e b2 55 00 00 00 00  |...........U....|
00000030  55 7f 0d 00 00 00 00 00  00 00 00 00 00 00 00 00  |U...............|
00000040  80 ea ff 66 79 7f 00 00  20 ec ff 66 79 7f 00 00  |...fy... ..fy...|
00000050  01 00 00 00 01 00 00 00  c0 1c 46 64 79 7f 00 00  |..........Fdy...|
00000060  c0 a7 02 66 79 7f 00 00  80 2a 46 64 79 7f 00 00  |...fy....*Fdy...|
00000070  80 2a 46 64 79 7f 00 00  00 eb ff 66 79 7f 00 00  |.*Fdy......fy...|
00000080  00 eb ff 66 79 7f 00 00  10 eb ff 66 79 7f 00 00  |...fy......fy...|
00000090  10 eb ff 66 79 7f 00 00  20 eb ff 66 79 7f 00 00  |...fy... ..fy...|
000000a0  20 eb ff 66 79 7f 00 00  40 29 46 64 79 7f 00 00  | ..fy...@)Fdy...|
000000b0  40 29 46 64 79 7f 00 00  00 00 00 00 00 00 00 00  |@)Fdy...........|
000000c0  00 b0 05 66 79 7f 00 00  14 10 00 00 00 00 00 00  |...fy...........|
000000d0  51 fd df 2c 00 00 00 00  78 49 61 01 00 00 00 00  |Q..,....xIa.....|
000000e0  58 10 02 66 79 7f 00 00  00 00 00 00 00 00 00 00  |X..fy...........|
000000f0  00 00 00 00 00 00 00 00  80 45 2b 66 79 7f 00 00  |.........E+fy...|
00000100  f0 44 2b 66 79 7f 00 00  b0 45 2b 66 79 7f 00 00  |.D+fy....E+fy...|
00000110  09 00 00 00 00 00 00 00  00 22 02 66 79 7f 00 00  |.........".fy...|
00000120  64 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |d...............|
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000140  b8 eb ff 66 79 7f 00 00  b8 eb ff 66 79 7f 00 00  |...fy......fy...|
00000150  00 00 00 00 00 00 00 00  80 f0 02 66 79 7f 00 00  |...........fy...|
00000160  60 01 82 6d 79 7f 00 00  00 00 00 00 00 00 00 00  |`..my...........|
00000170  d8 2b 38 73 79 7f 00 00  00 22 02 66 79 7f 00 00  |.+8sy....".fy...|
00000180  40 26 13 66 79 7f 00 00  70 26 13 66 79 7f 00 00  |@&.fy...p&.fy...|
00000190  80 26 13 66 79 7f 00 00  2a 00 00 00 10 00 00 00  |.&.fy...*.......|
000001a0  80 1c 46 64 79 7f 00 00  00 00 00 00 00 00 00 00  |..Fdy...........|
000001b0  88 ea ff 66 79 7f 00 00  00 00 00 00 00 00 00 00  |...fy...........|
000001c0  22 00 00 00 79 7f 00 00  c0 40 03 66 79 7f 00 00  |"...y....@.fy...|
000001d0  10 00 00 00 aa aa aa aa  00 00 00 00 00 00 00 00  |................|
000001e0  00 00 00 00 79 7f 00 00  60 e9 ff 66 79 7f 00 00  |....y...`..fy...|
000001f0  10 f7 ff 66 79 7f 00 00  00 00 00 00 00 00 00 00  |...fy...........|
00000200
FilesystemChangeWatcher: _timeoutActive=Y, _throttled=N, _waitingForNotifyCount=1
  EMPTY Q: waitingForTimeout=N, noAction=N, stat=Y, immediateStat=Y, readdir=Y, notify=N

WatchedTailFile-WatchedFileState: path="/opt/splunkforwarder/var/run/nmon/var/csv_repository/Dymas_24_JUL_2015_053319_FILE_444882_20150724070843.nmon.csv", flags=0x24003
First 144 bytes of PathnameStat @0x7f7964462ac8:
00000000  31 35 3a 32 33 3a 35 38  2e 34 39 37 20 2d 30 33  |15:23:58.497 -03|
00000010  30 30 20 49 4e 46 4f 20  20 53 65 72 76 65 72 43  |00 INFO  ServerC|
00000020  6f 6e 66 69 67 20 2d 20  4d 79 20 6e 65 77 6c 79  |onfig - My newly|
00000030  20 67 65 6e 65 72 61 74  65 64 20 47 55 49 44 20  | generated GUID |
00000040  69 73 20 37 30 45 35 45  37 38 42 2d 42 38 46 33  |is 70E5E78B-B8F3|
00000050  2d 34 36 35 38 2d 38 30  43 30 2d 41 41 32 36 34  |-4658-80C0-AA264|
00000060  35 43 37 35 46 43 42 0a  30 37 2d 31 34 2d 32 30  |5C75FCB.07-14-20|
00000070  31 35 20 31 35 3a 32 33  3a 35 38 2e 34 39 38 20  |15 15:23:58.498 |
00000080  2d 30 33 30 30 20 49 4e  46 4f 20 20 53 65 72 76  |-0300 INFO  Serv|
00000090
FilesystemChangeWatcher: _timeoutActive=Y, _throttled=N, _waitingForNotifyCount=1
  EMPTY Q: waitingForTimeout=N, noAction=N, stat=Y, immediateStat=Y, readdir=Y, notify=N
  Timeout: _when = 2321387029209298234.5641075399597436973, _initialMsec = 8534995651516190534
file-in: _initialized=Y, _lastCharWasNewline=N, _lastReadHadNulls=N, _wasCrcConflict=N, _warned=N
         _nullsWarned=N, _wasTooNew=N, _exists=N, _noDebug=N
         _hadExplicitSource=N, _crossedInitCrcLenBoundary=N, _classifiedAtLeastOnce=N, _fileReplaced=N, _readPathAfterRealEOF=N
         _onlyNotifiedOnce=Y, _isArchive=N, _isCached=111213, _unowned=N, _deleteOnEOF=N
         _overrideDeleteOnEOF=N, _doNotDeleteChildren=N, _readFromEnd=N, _readIrregardless=N
         _fileCheckMethod=0, _crcSalt=<null>, _origPath=<null>
         _bytesRead=0, _storingBytesRead=0, _initCrc=0x0, _seekCrc=0x0
         _filenameCrc=0x91b72885e4a713ee, _fallbackCrc=0x0, _lastEOFTime=<zero>, _modTime=<zero>
         _eofSeconds=3, _ignoreThresh=<zero>, _initCrcBytes=256, _initCrcForBatch=0x0
         _pendingMetadata=<null>
         _prevFd=-1, _pdModels=[0 PDs]
         _rescheduleDelay=1000, _rescheduleTarget=<zero>, _name=/opt/splunkforwarder/var/run/nmon/var/csv_repository/Dymas_24_JUL_2015_053319_FILE_444882_20150724070843.nmon.csv, _statusName=
         _st=[dev=64768, ino=1850031, mode=100600, size=1130, mtime=1437732465, owner=533, group=533]
         _toStringPrefix=state=0x0x7f7964462a40, _backoff=0
         _stdataInputHeaderProcessing=[]

         _detectTrailingNulls=N, _detectReadingFromOffSet=N, _readAndSkipHeader=N, _uniqueId=72747
  _rawPath=


x86 CPUID registers:
         0: 0000000B 756E6547 6C65746E 49656E69
         1: 000206C2 10200800 029EE3FF BFEBFBFF
         2: 55035A01 00F0B2FF 00000000 00CA0000
         3: 00000000 00000000 00000000 00000000
         4: 00000000 00000000 00000000 00000000
         5: 00000040 00000040 00000003 00001120
         6: 00000007 00000002 00000009 00000000
         7: 00000000 00000000 00000000 00000000
         8: 00000000 00000000 00000000 00000000
         9: 00000000 00000000 00000000 00000000
         A: 07300403 00000004 00000000 00000603
         B: 00000000 00000000 000000AD 00000010
  80000000: 80000008 00000000 00000000 00000000
  80000001: 00000000 00000000 00000001 2C100800
  80000002: 65746E49 2952286C 6F655820 2952286E
  80000003: 55504320 20202020 20202020 58202020
  80000004: 30353635 20402020 37362E32 007A4847
  80000005: 00000000 00000000 00000000 00000000
  80000006: 00000000 00000000 01006040 00000000
  80000007: 00000000 00000000 00000000 00000100
  80000008: 00003028 00000000 00000000 00000000
terminating...
Tags (2)
1 Solution

yannK
Splunk Employee
Splunk Employee

The crash pattern is similar to a known bug in splunk 6.2.4, the fix will be in the next version.

View solution in original post

hrawat_splunk
Splunk Employee
Splunk Employee

6.2.4.1 patch is already available to address this issue. Ask splunk support to provide bits.

guilmxm
SplunkTrust
SplunkTrust

6.2.5 is globally available for download, with the patch in change log:

2015-8-11   SPL-104017  Splunkd crash due to assertion failure in Tailing.

guilmxm
SplunkTrust
SplunkTrust

Hi armonsal,

Splunk 6.2.4 seems to have introduced a bug that causes splunkd to crash when a monitor watches for files that may be deleted (maybe too fast ?)

I see in your output that the crash is related with the Nmon Performance Monitor:

 WatchedTailFile-WatchedFileState: path="/opt/splunkforwarder/var/run/nmon/var/csv_repository/Dymas_24_JUL_2015_053319_FILE_444882_20150724070843.nmon.csv", flags=0x24003

The crash is not directly caused by Nmon App, until recently the processing steps used to create csv files in the same directory than splunk watches for, in some cases empty files could be created and deleted by nmon2csv converters, which causes splunkd in 6.2.4 to crash. (which is totally unexpected and wasn't the case before)

I have released on 5 august 2014 an hotfix release with a workaround to manage this, now files are moved from a working directory to final directory splunk watches for, which solves the issue from splunkd.

Please update to Nmon Perf Monitor 1.6.04 and your problem will be solved.

Guilhem

0 Karma

yannK
Splunk Employee
Splunk Employee

The crash pattern is similar to a known bug in splunk 6.2.4, the fix will be in the next version.

oHable
Explorer

Hi,

big thanks for that information...

sincerely oliver

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...