Re: Why does my Splunk server keep crashing?

dmitri47 · ‎05-21-2018

-bash-4.1$ cat crash-2018-05-21-09:41:12.log
[build fa31da744b51] 2018-05-21 09:41:12
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 12969 running under UID 18002.
Crashing thread: DistributedSearchResultCollectorThread
Registers:
RIP: [0x00007FA78E16B495] gsignal + 53 (libc.so.6 + 0x32495)
RDI: [0x00000000000032A9]
RSI: [0x00000000000032C9]
RBP: [0x00007FA7916AEC30]
RSP: [0x00007FA78B5FEA08]
RAX: [0x0000000000000000]
RBX: [0x00007FA7887FD000]
RCX: [0xFFFFFFFFFFFFFFFF]
RDX: [0x0000000000000006]
R8: [0x0000000000000200]
R9: [0xFEFEFEFEFEFEFEFF]
R10: [0x0000000000000008]
R11: [0x0000000000000206]
R12: [0x00007FA7915F37C6]
R13: [0x00007FA791793680]
R14: [0x00007FA78B688010]
R15: [0x00007FA78B5FED10]
EFL: [0x0000000000000206]
TRAPNO: [0x0000000000000000]
ERR: [0x0000000000000000]
CSGSFS: [0x0000000000000033]
OLDMASK: [0x0000000000000000]

OS: Linux
Arch: x86-64

Backtrace (PIC build):
[0x00007FA78E16B495] gsignal + 53 (libc.so.6 + 0x32495)
[0x00007FA78E16CC75] abort + 373 (libc.so.6 + 0x33C75)
[0x00007FA78E16460E] ? (libc.so.6 + 0x2B60E)
[0x00007FA78E1646D0] __assert_perror_fail + 0 (libc.so.6 + 0x2B6D0)
[0x00007FA7909B0E6F] _ZN9EventLoop3addEP8PolledFd18PollableDescriptorj + 591 (splunkd + 0x1251E6F)
[0x00007FA7909B2BAE] _ZN19InThreadActorNotifyC2EP9EventLoop + 46 (splunkd + 0x1253BAE)
[0x00007FA7909B2E50] _ZN9EventLoop3runEv + 96 (splunkd + 0x1253E50)
[0x00007FA790A6DAF0] _ZN15TcpOutboundLoop3runEv + 16 (splunkd + 0x130EAF0)
[0x00007FA78FFBFF05] _ZN21EventLoopRunnerThread4mainEv + 37 (splunkd + 0x860F05)
[0x00007FA790A6EB1F] _ZN6Thread8callMainEPv + 111 (splunkd + 0x130FB1F)
[0x00007FA78E4D4AA1] ? (libpthread.so.0 + 0x7AA1)
[0x00007FA78E221BCD] clone + 109 (libc.so.6 + 0xE8BCD)
Linux / / 2.6.32-696.28.1.el6.x86_64 / #1 SMP Thu Apr 26 04:27:41 EDT 2018 / x86_64
Last few lines of stderr (may contain info on assertion failure, but also could be old):
2018-05-21 07:50:44.714 -0400 splunkd started (build fa31da744b51)
2018-05-21 08:05:58.776 -0400 splunkd started (build fa31da744b51)
splunkd: /home/build/build-src/minty/src/util/EventLoop.cpp:843: void EventLoop::add(PolledFd*, PollableDescriptor, events_mask_t): Assertion `fd.valid()' failed.
2018-05-21 08:15:40.920 -0400 splunkd started (build fa31da744b51)
2018-05-21 08:30:58.927 -0400 splunkd started (build fa31da744b51)
2018-05-21 08:40:36.969 -0400 splunkd started (build fa31da744b51)
2018-05-21 08:50:37.156 -0400 splunkd started (build fa31da744b51)
2018-05-21 09:05:55.191 -0400 splunkd started (build fa31da744b51)
2018-05-21 09:25:37.188 -0400 splunkd started (build fa31da744b51)
2018-05-21 09:35:45.231 -0400 splunkd started (build fa31da744b51)

/etc/redhat-release: Red Hat Enterprise Linux Server release 6.9 (Santiago)
glibc version: 2.12
glibc release: stable
Last errno: 23
Threads running: 16
Runtime: 327.200553s
argv: [splunkd -p 8089 restart]
Process renamed: [splunkd pid=6741] splunkd -p 8089 restart [process-runner]
Process renamed: [splunkd pid=6741] search --id=scheduler_nobodyf5_RMD54f4818d5a227023d_at_1526910000_56 --maxbuckets=0 --ttl=600 --maxout=500000 --maxtime=8640000 --lookups=1 --reduce_freq=10 --user=splunk-system-user --pro --roles=admin:splunk-system-role

Regex JIT disabled due to SELinux

using CLOCK_MONOTONIC
Preforked process=0/65: process_runtime_msec=606, search=0/124, search_runtime_msec=592, new_user=N, export_search=N, args_size=256, completed_searches=0, user_changes=0, cache_rotations=0

Thread: "DistributedSearchResultCollectorThread", did_join=0, ready_to_run=Y, main_thread=N
First 8 bytes of Thread token @0x7fa78ab1ce10:
00000000 00 f7 5f 8b a7 7f 00 00 |.._.....|
00000008

x86 CPUID registers:
0: 0000000D 756E6547 6C65746E 49656E69
1: 000206D2 0A040800 9E982203 1F8BFBFF
2: 76036301 00F0B5FF 00000000 00C10000
3: 00000000 00000000 00000000 00000000
4: 00000000 00000000 00000000 00000000
5: 00000000 00000000 00000000 00000000
6: 00000077 00000002 00000009 00000000
7: 00000000 00000000 00000000 00000000
8: 00000000 00000000 00000000 00000000
9: 00000000 00000000 00000000 00000000
A: 07300401 0000007F 00000000 00000000
B: 00000000 00000000 000000CD 0000000A
C: 00000000 00000000 00000000 00000000
😧 00000000 00000000 00000000 00000000
80000000: 80000008 00000000 00000000 00000000
80000001: 00000000 00000000 00000001 28100800
80000002: 65746E49 2952286C 6F655820 2952286E
80000003: 55504320 2D354520 37383632 33762057
80000004: 33204020 4730312E 00007A48 00000000
80000005: 00000000 00000000 00000000 00000000
80000006: 00000000 00000000 01006040 00000000
80000007: 00000000 00000000 00000000 00000100
80000008: 00003028 00000000 00000000 00000000
terminating...

dmitri47 · ‎05-21-2018

https://answers.splunk.com/answers/290645/why-is-our-splunk-624-forwarder-on-linux-crashing.html

Splunk 6.2.4 seems to have introduced a bug that causes splunkd to crash when a monitor watches for files that may be deleted (maybe too fast ?)

I see in your output that the crash is related with the Nmon Performance Monitor:

WatchedTailFile-WatchedFileState: path="/opt/splunkforwarder/var/run/nmon/var/csv_repository/Dymas_24_JUL_2015_053319_FILE_444882_20150724070843.nmon.csv", flags=0x24003
The crash is not directly caused by Nmon App, until recently the processing steps used to create csv files in the same directory than splunk watches for, in some cases empty files could be created and deleted by nmon2csv converters, which causes splunkd in 6.2.4 to crash. (which is totally unexpected and wasn't the case before)

I have released on 5 august 2014 an hotfix release with a workaround to manage this, now files are moved from a working directory to final directory splunk watches for, which solves the issue from splunkd.

Please update to Nmon Perf Monitor 1.6.04 and your problem will be solved.

dmitri47 · ‎05-21-2018

Seems that we had this issue back since 6.2.4, fixed since then, and broken again with 7.x somehow... SMH

lguinn2 · ‎05-21-2018

What is the version of Splunk, plus the size of your environment?

dmitri47 · ‎05-21-2018

This first started when I upgraded from Splunk 7.0.3 to 7.1

After noticing that this one server (a Splunk SearchHead) was crashing every 3-5 mins, I downgraded that whole environment from 7.1 to 7.0.3 (like 8-9 servers total)

It was fine for 1-2 days and now is crashing all o the time.
Other enclaves work just fine = Splunk 7.1 is stable

Size = 1 CM server, 2 indexers, 4 searchheads, up to 150 splunkforwarders

ddrillic · ‎05-21-2018

I would go to Support ...

dmitri47 · ‎05-21-2018

Yeah... Will post if and when I get resolution. Gooogle says that a bunch of other users have had this issue. Why can't splunk fix it?

tkrishnan · ‎07-24-2018

@dmitri47 did you get anywhere with this one? Heard anything from Support ?

SithLord · ‎07-24-2018

Soooo... There was a known issue in Splunk 7.0.3 and the upgrade to Splunk 7.1.1 fixed it.
Has been working well since.

tkrishnan · ‎07-24-2018

thanks for the super quick answer. we have the same issue with 6.6.3. Did you find an issue number or something for this one so i can trace it back to my version's known issue documentation?

SithLord · ‎07-24-2018

I would look here:

http://docs.splunk.com/Documentation/Splunk/7.1.1/ReleaseNotes/Fixedissues

Fixed issues:

2018-05-18 SPL-154138, SPL-154542, SPL-154544, FAST-9662 Searches with multikv extraction use too much memory: potentially orders of magnitude more than previous versions.

SithLord · ‎07-24-2018

http://docs.splunk.com/Documentation/Splunk/7.1.2/ReleaseNotes/Fixedissues

SithLord · ‎07-24-2018

http://docs.splunk.com/Documentation/Splunk/7.1.2/ReleaseNotes/Fixedissues

solarboyz1 · ‎05-21-2018

It appears you have SE Linux enabled, have you followed:
https://github.com/doksu/selinux_policy_for_splunk

dmitri47 · ‎05-21-2018

We have 4 enclaves and Splunk on all 4. All have the same SE Linux set, but issues only on 1 server.

Why does my Splunk server keep crashing?

Purpose in Action: How Splunk Is Helping Power an Inclusive Future for All

[Upcoming Webinar] Demo Day: Transforming IT Operations with Splunk

New Year. New Skills. New Course Releases from Splunk Education

Join the Conversation

Why does my Splunk server keep crashing?

Purpose in Action: How Splunk Is Helping Power an Inclusive Future for All

[Upcoming Webinar] Demo Day: Transforming IT Operations with Splunk

New Year. New Skills. New Course Releases from Splunk Education