Deployment Architecture

Linux - splunkd v4.1.4 crash with LDAP authentication enabled

Explorer

Enabling LDAP - splunkd crash on startup.

  • Running Splunk standalone (i.e. not clustered as per previous post)
  • Splunk v4.1.4 (build 82143).
  • LDAP against Windows Server 2003 Active Directory. Server hit has a global catalog.
  • ldapsearch tests for both groups & users are successful as per splunk docs.
  • Have set groupBaseFilter to only include (cn=APP-Splunk*) groups (3 exist)
  • Have set userBaseFilter to only include my account (cn=myname)
  • splunkd_stderr.log says: src/tcmalloc.cc:353] Attempt to free invalid pointer: 0x1b00010
  • Last line in splunkd.log says: INFO loader - Instantiated plugin: thruputprocessor
  • Running on physical box with 8 cores & 16GB RAM. SLES 11 amd64.
  • Reverting back to Splunk (internal) authenticaiton allows Splunk to start clean.
  • Crash log output below.

Any ideas?

[build 82143]
Received fatal signal 6 (Aborted).
 Cause:
   Signal sent by PID 29447 running under UID 0.
 Crashing thread: Main Thread
 Registers:
    RIP:  [0x00007F38F5A9C645] gsignal + 53 (/lib64/libc.so.6)
    RDI:  [0x0000000000007307]
    RSI:  [0x0000000000007310]
    RBP:  [0x00007F38F5465F80]
    RSP:  [0x00007F38F5465AF8]
    RAX:  [0x0000000000000000]
    RBX:  [0x00007F38F5465C30]
    RCX:  [0xFFFFFFFFFFFFFFFF]
    RDX:  [0x0000000000000006]
    R8:  [0x00007F38F5B837C0]
    R9:  [0x2064696C61766E69]
    R10:  [0x0000000000000008]
    R11:  [0x0000000000000202]
    R12:  [0x0000000000F4BBA0]
    R13:  [0x0000000000000000]
    R14:  [0x0000000000000000]
    R15:  [0x0000000000001000]
    EFL:  [0x0000000000000202]
    TRAPNO:  [0x0000000000000000]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x0000000000000033]
    OLDMASK:  [0x0000000000000000]

 OS: Linux
 Arch: x86-64

 Backtrace:
  [0x00007F38F5A9DC33] abort + 387 (/lib64/libc.so.6)
  [0x0000000000AC36EF] ? (splunkd)
  [0x0000000000AC38A6] _ZN22TCMalloc_CrashReporter12PrintfAndDieEPKcz + 150 (splunkd)
  [0x0000000000ABC08B] _ZN123_GLOBAL__N__ZN61FLAG__namespace_do_not_use_directly_use_DECLARE_int64_instead43FLAGS_tcmalloc_large_alloc_report_thresholdE11InvalidFreeEPv + 43 (splunkd)
  [0x0000000000DD7D35] tc_free + 453 (splunkd)
  [0x00007F38F5B4A10D] __res_iclose + 189 (/lib64/libc.so.6)
  [0x00007F38F5B75234] ? (/lib64/libc.so.6)
  [0x00007F38F5B751C2] __libc_thread_freeres + 34 (/lib64/libc.so.6)
  [0x00007F38F7052083] ? (/lib64/libpthread.so.0)
  [0x00007F38F5B3D10D] clone + 109 (/lib64/libc.so.6)
 Linux / myserver / 2.6.27.45-0.1-default / #1 SMP 2010-02-22 16:49:47 +0100 / x86_64
 Last few lines of stderr (may contain info on assertion failure, but also could be old):
    src/tcmalloc.cc:353] Attempt to free invalid pointer: 0x1b00010

 /etc/SuSE-release: SUSE Linux Enterprise Server 11 (x86_64)
 glibc version: 2.9
 glibc release: stable
Threads running: 14
terminating...

Explorer

Hi. I finally have a good answer for your question.

Over the last several months we saw a slow trickle of reports of this crash, but we never had enough information to isolate it. What made it more frustrating is that it seemed to happen to just a few customers, and even for them it seemed to be hard to reproduce.. sometimes they would have splunk crash several times in a row then the problem would suddenly disappear for no apparent reason.

Finally we had enough reports to piece together the common thread: all of the reports are running 64-bit SuSE 11 of some sort. After a LOT of investigation we found out that it's due to a known bug in SuSE which Novell is planning to fix for OpenSuSE 11.4. They'll presumably also fix it in a future SLES version as well.

The good news is that we have identified a workaround to splunk that lets us avoid this bug and will include it in all future versions of splunk (i.e. newer than "4.1.7" which is current as of this writing)

If this crash is happening often enough to cause you serious problems (and you can't wait for the next splunk release) you may want to get an early-access testing build from splunk support. Please reference bug "SPL-37331" so they know what issue you're referring to. Again, this is ONLY for 64-bit SuSE installs: no other OSes are affected by this issue.

Explorer

Jason -- at least of the reports that we've seen several seem to have popped up when enabling LDAP auth. Other crash reports didn't have LDAP at all. We've also successfully run LDAP on SuSE 11 with splunk 4.1.6 in-house, without problems.

So you're right -- the bug in SuSE's libc isn't related to LDAP. However, it does seem that using LDAP changes the timing of things to help provoke the crash for some environments.

0 Karma

Motivator

It does not have anything to do with AD authentication, as boxes I'm working with use Splunk standard auth.

0 Karma

Motivator

This bug evidently can also manifest itself as a crash on restart, so you may not notice it at first, but crash logs will accumulate in $SPLUNK_HOME/var/log/splunk/

0 Karma

Splunk Employee
Splunk Employee

The telltales are __res_iclose and __libc_thread_freeres in the backtrace.

SplunkTrust
SplunkTrust

I would highly recommend that you persue a support case for ANY splunkd crashes. You might get a suitable answer here from someone - but more likely your crashinfo is going to need to be evaluated by someone who has access to the source code to get more of a context around the backtrace above.

Explorer

Thanks. Will follow up with Splunk

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!