topic Segfault errors on a indexer in cluster in Deployment Architecture

Segfault errors on a indexer in cluster

sylim_splunk — Wed, 30 Sep 2020 01:33:50 GMT

We have this issue very frequently which appeared to have started right after the last upgrade.
Below kernel logs shows the frequency, Splunk process on the indexer appears running without restart so it appears to be from search processes.

*Linux splunkindexer1 2.6.32-754.9.1.el6.x86_64 #1 SMP Wed Dec 21 10:08:21 PST 2018 x86_64 x86_64 x86_64 GNU/Linux
-bash-4.1$ cat /var/log/messages | grep -i kernel| tail
Jul 31 08:16:24 splunkindexer1 kernel: splunkd[3149]: segfault at 7ff425810057 ip 000055ad21554260 sp 00007ff4047f8068 error 4 in splunkd[55ad1f3d2000+2e2b000]
Jul 31 08:19:34 splunkindexer1 kernel: splunkd[7907]: segfault at 7ff42540e057 ip 000055ad21554260 sp 00007ff4043f6068 error 4 in splunkd[55ad1f3d2000+2e2b000]
Jul 31 08:20:30 splunkindexer1 kernel: splunkd[22411]: segfault at 7ff42560f057 ip 000055ad21554260 sp 00007ff4045f7068 error 4 in splunkd[55ad1f3d2000+2e2b000]
Jul 31 08:21:07 splunkindexer1 kernel: splunkd[30162]: segfault at 7ff42580f057 ip 000055ad21554260 sp 00007ff4047f7068 error 4 in splunkd[55ad1f3d2000+2e2b000]

Jul 31 08:51:34 splunkindexer1 kernel: splunkd[4092]: segfault at 7ff4224104f7 ip 000055ad21554260 sp 00007ff4013f8508 error 4 in splunkd[55ad1f3d2000+2e2b000]*

This is from one of the crash logs.

Received fatal signal 11 (Segmentation fault).
Cause:
No memory mapped at address [0x00000261CB7ECF].
Crashing thread: BatchSearch
"SNIP"

Backtrace (PIC build):
[0x000056345C300260] st_decode_from_vbe + 0 (splunkd + 0x2182260)
[0x000056345C2EC4DA] ? (splunkd + 0x216E4DA)
[0x000056345C2EC7EF] _seek + 143 (splunkd + 0x216E7EF)
[0x000056345C2EF4A9] and_literals + 713 (splunkd + 0x21714A9)
[0x000056345C2F3316] ? (splunkd + 0x2175316)
"SNIP"

Last errno: 2
Threads running: 11
Runtime: 52652.730678s
argv: [splunkd -p 8089 restart splunkd]
Process renamed: [splunkd pid=3960] splunkd -p 8089 restart splunkd [process-runner]

Process renamed: [splunkd pid=3960] search --id=remote_sh1_scheduler_d5331search_RMD561462962f68d150_at_1562933700_3076_AAAAAAAA-1111-2222-AAAA-ADAAA6256C5C --maxbuckets=0 --ttl=60 --maxout=0 --maxtime=0 --lookups=1 --streaming --sidtype=normal --outCsv=true --acceptSrsLevel=1 --user=d5331 --pro --roles=power:user

Re: Segfault errors on a indexer in cluster

sylim_splunk — Wed, 30 Sep 2020 01:33:53 GMT

This could have been caused by some corrupted buckets when searches run against them.
You may want to fix the buckets and try the same search to see if it fixes it.
Follow the steps below to get list of buckets suspected corrupted.

*** How to get the list of corrupt buckets ***
1. @the indexer, cd to $SPLUNK_HOME/var/log/splunk
2. Run below
$ grep "MAP:" crash-2019-07-31*.log |grep "/opt/splunk/storage"
"/opt/splunk/storage" varies according to your deployment set up and is taken from the line below in crash log.
crash-2019-07-31-00:15:17.log:
MAP: 7f00e9cdb000-7f00ea000000 r--s 00000000 fd:03 563872524 /opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D/1560184689-1560184620-11473276039248555956.tsidx
3. It will return the problematic buckets. From the above example, the bucket location is /opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D

*** How to fix the corrupted buckets ***
Rebuilding the bucket using fsck should fix the problem. Follow the steps to rebuild buckets:
0. @CM, splunk enable maintenance-mode
1. @Anonymous, splunk offline
2. @Anonymous, for all the buckets from above, run splunk fsck repair --one-bucket --bucket-path="path_from_above"
i.e:
splunk fsck repair --one-bucket --bucket-path=/opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D
3. @Anonymous, splunk start
4. @CM, splunk disable maintenance-mode

If this is not helping improve the situation please contact Splunk Support with details of deployment architecture and a drag from the indexer.

Re: Segfault errors on a indexer in cluster

codebuilder — Wed, 31 Jul 2019 17:23:17 GMT

That error means that a process (splunkd) has attempted to access memory that is not assigned to it. I believe this is/was a known bug in Splunk 7.1.x and below.

Re: Segfault errors on a indexer in cluster

codebuilder — Wed, 31 Jul 2019 17:26:41 GMT

Known issue (SPL-153976) and fixed as part of 7.1.3