Enabling LDAP - splunkd crash on startup.
Any ideas?
[build 82143]
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 29447 running under UID 0.
Crashing thread: Main Thread
Registers:
RIP: [0x00007F38F5A9C645] gsignal + 53 (/lib64/libc.so.6)
RDI: [0x0000000000007307]
RSI: [0x0000000000007310]
RBP: [0x00007F38F5465F80]
RSP: [0x00007F38F5465AF8]
RAX: [0x0000000000000000]
RBX: [0x00007F38F5465C30]
RCX: [0xFFFFFFFFFFFFFFFF]
RDX: [0x0000000000000006]
R8: [0x00007F38F5B837C0]
R9: [0x2064696C61766E69]
R10: [0x0000000000000008]
R11: [0x0000000000000202]
R12: [0x0000000000F4BBA0]
R13: [0x0000000000000000]
R14: [0x0000000000000000]
R15: [0x0000000000001000]
EFL: [0x0000000000000202]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]
OS: Linux
Arch: x86-64
Backtrace:
[0x00007F38F5A9DC33] abort + 387 (/lib64/libc.so.6)
[0x0000000000AC36EF] ? (splunkd)
[0x0000000000AC38A6] _ZN22TCMalloc_CrashReporter12PrintfAndDieEPKcz + 150 (splunkd)
[0x0000000000ABC08B] _ZN123_GLOBAL__N__ZN61FLAG__namespace_do_not_use_directly_use_DECLARE_int64_instead43FLAGS_tcmalloc_large_alloc_report_thresholdE11InvalidFreeEPv + 43 (splunkd)
[0x0000000000DD7D35] tc_free + 453 (splunkd)
[0x00007F38F5B4A10D] __res_iclose + 189 (/lib64/libc.so.6)
[0x00007F38F5B75234] ? (/lib64/libc.so.6)
[0x00007F38F5B751C2] __libc_thread_freeres + 34 (/lib64/libc.so.6)
[0x00007F38F7052083] ? (/lib64/libpthread.so.0)
[0x00007F38F5B3D10D] clone + 109 (/lib64/libc.so.6)
Linux / myserver / 2.6.27.45-0.1-default / #1 SMP 2010-02-22 16:49:47 +0100 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
src/tcmalloc.cc:353] Attempt to free invalid pointer: 0x1b00010
/etc/SuSE-release: SUSE Linux Enterprise Server 11 (x86_64)
glibc version: 2.9
glibc release: stable
Threads running: 14
terminating...
Hi. I finally have a good answer for your question.
Over the last several months we saw a slow trickle of reports of this crash, but we never had enough information to isolate it. What made it more frustrating is that it seemed to happen to just a few customers, and even for them it seemed to be hard to reproduce.. sometimes they would have splunk crash several times in a row then the problem would suddenly disappear for no apparent reason.
Finally we had enough reports to piece together the common thread: all of the reports are running 64-bit SuSE 11 of some sort. After a LOT of investigation we found out that it's due to a known bug in SuSE which Novell is planning to fix for OpenSuSE 11.4. They'll presumably also fix it in a future SLES version as well.
The good news is that we have identified a workaround to splunk that lets us avoid this bug and will include it in all future versions of splunk (i.e. newer than "4.1.7" which is current as of this writing)
If this crash is happening often enough to cause you serious problems (and you can't wait for the next splunk release) you may want to get an early-access testing build from splunk support. Please reference bug "SPL-37331" so they know what issue you're referring to. Again, this is ONLY for 64-bit SuSE installs: no other OSes are affected by this issue.
Jason -- at least of the reports that we've seen several seem to have popped up when enabling LDAP auth. Other crash reports didn't have LDAP at all. We've also successfully run LDAP on SuSE 11 with splunk 4.1.6 in-house, without problems.
So you're right -- the bug in SuSE's libc isn't related to LDAP. However, it does seem that using LDAP changes the timing of things to help provoke the crash for some environments.
It does not have anything to do with AD authentication, as boxes I'm working with use Splunk standard auth.
This bug evidently can also manifest itself as a crash on restart, so you may not notice it at first, but crash logs will accumulate in $SPLUNK_HOME/var/log/splunk/
The telltales are __res_iclose and __libc_thread_freeres in the backtrace.
I would highly recommend that you persue a support case for ANY splunkd crashes. You might get a suitable answer here from someone - but more likely your crashinfo is going to need to be evaluated by someone who has access to the source code to get more of a context around the backtrace above.
Thanks. Will follow up with Splunk