Agents are installed on the 3 servers capturing WebSphere Application Server logs. Nothing has been changed in the environment except WAS JVM was patched and then restarted the JVM. After that every other day the agent gets crashed and I am not able to understand from the logs.. I dont think JVM Patching is relevant but still mentioned it. Posting a snippet from a recently crashed agent:
version Splunk 4.2.1 (build 98164)
was 7.0.0.31
OS Linux 2.6.18-348.3.1.el5 #1 SMP Tue Mar 5 13:19:32 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
LOG
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 29658 running under UID 2441.
Crashing thread: MainTailingThread
Registers:
RIP: [0x0000003809030295] gsignal + 53 (/lib64/libc.so.6)
RDI: [0x00000000000073DA]
RSI: [0x00000000000073F4]
RBP: [0x00002B4911006940]
RSP: [0x00002B4911005628]
RAX: [0x0000000000000000]
RBX: [0x00002B49110056D0]
RCX: [0xFFFFFFFFFFFFFFFF]
RDX: [0x0000000000000006]
R8: [0x0000000000000080]
R9: [0x0101010101010101]
R10: [0x0000000000000008]
R11: [0x0000000000000202]
R12: [0x00007FFF6C43AA7D]
R13: [0x0000000000F984E8]
R14: [0x000000000000021A]
R15: [0x0000000000F97D18]
EFL: [0x0000000000000202]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]
OS: Linux
Arch: x86-64
Backtrace:
[0x0000003809031D40] abort + 272 (/lib64/libc.so.6)
[0x0000003809029716] assert_fail + 246 (/lib64/libc.so.6)
[0x0000000000671673] _ZSt16introsort_loopIN9_gnu_cxx17normal_iteratorIPP15WatchedTailFileSt6vectorIS3_SaIS3_EEEEl22WTFTimes
tampComparatorEvT_SA_T0_T1 + 515 (splunkd)
[0x0000000000665C06] _ZN10TailReader11trimOpenFdsEv + 230 (splunkd)
[0x0000000000665FDA] _ZN10TailReader8readDataER15WatchedTailFileP11TailWatcher + 650 (splunkd)
[0x0000000000666448] _ZN10TailReader8readFileER15WatchedTailFileP11TailWatcher + 184 (splunkd)
[0x0000000000666592] _ZN11TailWatcher8readFileER15WatchedTailFile + 82 (splunkd)
[0x0000000000668419] _ZN11TailWatcher11fileChangedEP16WatchedFileStateRK7Timeval + 1161 (splunkd)
[0x0000000000B6DAFC] _ZN30FilesystemChangeInternalWorker12when_expiredERy + 428 (splunkd)
[0x0000000000BB8E03] _ZN11TimeoutHeap18runExpiredTimeoutsER7Timeval + 227 (splunkd)
[0x0000000000B673D8] _ZN9EventLoop3runEv + 216 (splunkd)
[0x000000000066B7AF] _ZN11TailWatcher3runEv + 143 (splunkd)
[0x000000000066D7C1] _ZN13TailingThread4mainEv + 289 (splunkd)
[0x0000000000BB6892] _ZN6Thread8callMainEPv + 66 (splunkd)
[0x0000003809C0683D] ? (/lib64/libpthread.so.0)
[0x00000038090D500D] clone + 109 (/lib64/libc.so.6)
Linux / alt-met-asol-was03 / 2.6.18-348.3.1.el5 / #1 SMP Tue Mar 5 13:19:32 EST 2013 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
2013-02-28 13:30:03.674 +0100 splunkd started (build 98164)
2013-04-15 13:46:42.551 +0200 Interrupt signal received
2013-04-15 13:50:48.847 +0200 splunkd started (build 98164)
2013-08-29 04:35:38.328 +0200 splunkd started (build 98164)
2014-02-24 11:45:17.483 +0100 splunkd started (build 98164)
2014-02-26 15:10:11.065 +0100 splunkd started (build 98164)
2014-03-03 11:45:29.363 +0100 Interrupt signal received
2014-03-03 11:45:38.980 +0100 splunkd started (build 98164)
2014-03-05 10:07:33.392 +0100 splunkd started (build 98164)
splunkd: /opt/splunk/p4/splunk/branches/hammer/src/pipeline/input/Tailing.cpp:538: bool WTFTimestampComparator::operator()(con
st WatchedTailFile*, const WatchedTailFile*): Assertion `!x->getLastEOFTime().isZero() && !y->getLastEOFTime().isZero()' failed.
/etc/redhat-release: Red Hat Enterprise Linux Server release 5.9 (Tikanga)
glibc version: 2.5
glibc release: stable
Threads running: 25
argv: [splunkd -p 8090 restart]
terminating...
... View more