Deployment Architecture

Linux - splunkd v4.1.4 crash with LDAP authentication enabled

scarteratwork
Explorer

Enabling LDAP - splunkd crash on startup.

  • Running Splunk standalone (i.e. not clustered as per previous post)
  • Splunk v4.1.4 (build 82143).
  • LDAP against Windows Server 2003 Active Directory. Server hit has a global catalog.
  • ldapsearch tests for both groups & users are successful as per splunk docs.
  • Have set groupBaseFilter to only include (cn=APP-Splunk*) groups (3 exist)
  • Have set userBaseFilter to only include my account (cn=myname)
  • splunkd_stderr.log says: src/tcmalloc.cc:353] Attempt to free invalid pointer: 0x1b00010
  • Last line in splunkd.log says: INFO loader - Instantiated plugin: thruputprocessor
  • Running on physical box with 8 cores & 16GB RAM. SLES 11 amd64.
  • Reverting back to Splunk (internal) authenticaiton allows Splunk to start clean.
  • Crash log output below.

Any ideas?

[build 82143]
Received fatal signal 6 (Aborted).
 Cause:
   Signal sent by PID 29447 running under UID 0.
 Crashing thread: Main Thread
 Registers:
    RIP:  [0x00007F38F5A9C645] gsignal + 53 (/lib64/libc.so.6)
    RDI:  [0x0000000000007307]
    RSI:  [0x0000000000007310]
    RBP:  [0x00007F38F5465F80]
    RSP:  [0x00007F38F5465AF8]
    RAX:  [0x0000000000000000]
    RBX:  [0x00007F38F5465C30]
    RCX:  [0xFFFFFFFFFFFFFFFF]
    RDX:  [0x0000000000000006]
    R8:  [0x00007F38F5B837C0]
    R9:  [0x2064696C61766E69]
    R10:  [0x0000000000000008]
    R11:  [0x0000000000000202]
    R12:  [0x0000000000F4BBA0]
    R13:  [0x0000000000000000]
    R14:  [0x0000000000000000]
    R15:  [0x0000000000001000]
    EFL:  [0x0000000000000202]
    TRAPNO:  [0x0000000000000000]
    ERR:  [0x0000000000000000]
    CSGSFS:  [0x0000000000000033]
    OLDMASK:  [0x0000000000000000]

 OS: Linux
 Arch: x86-64

 Backtrace:
  [0x00007F38F5A9DC33] abort + 387 (/lib64/libc.so.6)
  [0x0000000000AC36EF] ? (splunkd)
  [0x0000000000AC38A6] _ZN22TCMalloc_CrashReporter12PrintfAndDieEPKcz + 150 (splunkd)
  [0x0000000000ABC08B] _ZN123_GLOBAL__N__ZN61FLAG__namespace_do_not_use_directly_use_DECLARE_int64_instead43FLAGS_tcmalloc_large_alloc_report_thresholdE11InvalidFreeEPv + 43 (splunkd)
  [0x0000000000DD7D35] tc_free + 453 (splunkd)
  [0x00007F38F5B4A10D] __res_iclose + 189 (/lib64/libc.so.6)
  [0x00007F38F5B75234] ? (/lib64/libc.so.6)
  [0x00007F38F5B751C2] __libc_thread_freeres + 34 (/lib64/libc.so.6)
  [0x00007F38F7052083] ? (/lib64/libpthread.so.0)
  [0x00007F38F5B3D10D] clone + 109 (/lib64/libc.so.6)
 Linux / myserver / 2.6.27.45-0.1-default / #1 SMP 2010-02-22 16:49:47 +0100 / x86_64
 Last few lines of stderr (may contain info on assertion failure, but also could be old):
    src/tcmalloc.cc:353] Attempt to free invalid pointer: 0x1b00010

 /etc/SuSE-release: SUSE Linux Enterprise Server 11 (x86_64)
 glibc version: 2.9
 glibc release: stable
Threads running: 14
terminating...

mitch
Explorer

Hi. I finally have a good answer for your question.

Over the last several months we saw a slow trickle of reports of this crash, but we never had enough information to isolate it. What made it more frustrating is that it seemed to happen to just a few customers, and even for them it seemed to be hard to reproduce.. sometimes they would have splunk crash several times in a row then the problem would suddenly disappear for no apparent reason.

Finally we had enough reports to piece together the common thread: all of the reports are running 64-bit SuSE 11 of some sort. After a LOT of investigation we found out that it's due to a known bug in SuSE which Novell is planning to fix for OpenSuSE 11.4. They'll presumably also fix it in a future SLES version as well.

The good news is that we have identified a workaround to splunk that lets us avoid this bug and will include it in all future versions of splunk (i.e. newer than "4.1.7" which is current as of this writing)

If this crash is happening often enough to cause you serious problems (and you can't wait for the next splunk release) you may want to get an early-access testing build from splunk support. Please reference bug "SPL-37331" so they know what issue you're referring to. Again, this is ONLY for 64-bit SuSE installs: no other OSes are affected by this issue.

mitch
Explorer

Jason -- at least of the reports that we've seen several seem to have popped up when enabling LDAP auth. Other crash reports didn't have LDAP at all. We've also successfully run LDAP on SuSE 11 with splunk 4.1.6 in-house, without problems.

So you're right -- the bug in SuSE's libc isn't related to LDAP. However, it does seem that using LDAP changes the timing of things to help provoke the crash for some environments.

0 Karma

Jason
Motivator

It does not have anything to do with AD authentication, as boxes I'm working with use Splunk standard auth.

0 Karma

Jason
Motivator

This bug evidently can also manifest itself as a crash on restart, so you may not notice it at first, but crash logs will accumulate in $SPLUNK_HOME/var/log/splunk/

0 Karma

jrodman
Splunk Employee
Splunk Employee

The telltales are __res_iclose and __libc_thread_freeres in the backtrace.

dwaddle
SplunkTrust
SplunkTrust

I would highly recommend that you persue a support case for ANY splunkd crashes. You might get a suitable answer here from someone - but more likely your crashinfo is going to need to be evaluated by someone who has access to the source code to get more of a context around the backtrace above.

scarteratwork
Explorer

Thanks. Will follow up with Splunk

0 Karma
Get Updates on the Splunk Community!

3 Ways to Make OpenTelemetry Even Better

My role as an Observability Specialist at Splunk provides me with the opportunity to work with customers of ...

What's New in Splunk Cloud Platform 9.2.2406?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2406 with many ...

Enterprise Security Content Update (ESCU) | New Releases

In August, the Splunk Threat Research Team had 3 releases of new security content via the Enterprise Security ...