Deployment Architecture

Segfault errors on a indexer in cluster

sylim_splunk
Splunk Employee
Splunk Employee

We have this issue very frequently which appeared to have started right after the last upgrade.
Below kernel logs shows the frequency, Splunk process on the indexer appears running without restart so it appears to be from search processes.


*Linux splunkindexer1 2.6.32-754.9.1.el6.x86_64 #1 SMP Wed Dec 21 10:08:21 PST 2018 x86_64 x86_64 x86_64 GNU/Linux
-bash-4.1$ cat /var/log/messages | grep -i kernel| tail
Jul 31 08:16:24 splunkindexer1 kernel: splunkd[3149]: segfault at 7ff425810057 ip 000055ad21554260 sp 00007ff4047f8068 error 4 in splunkd[55ad1f3d2000+2e2b000]
Jul 31 08:19:34 splunkindexer1 kernel: splunkd[7907]: segfault at 7ff42540e057 ip 000055ad21554260 sp 00007ff4043f6068 error 4 in splunkd[55ad1f3d2000+2e2b000]
Jul 31 08:20:30 splunkindexer1 kernel: splunkd[22411]: segfault at 7ff42560f057 ip 000055ad21554260 sp 00007ff4045f7068 error 4 in splunkd[55ad1f3d2000+2e2b000]
Jul 31 08:21:07 splunkindexer1 kernel: splunkd[30162]: segfault at 7ff42580f057 ip 000055ad21554260 sp 00007ff4047f7068 error 4 in splunkd[55ad1f3d2000+2e2b000]

Jul 31 08:51:34 splunkindexer1 kernel: splunkd[4092]: segfault at 7ff4224104f7 ip 000055ad21554260 sp 00007ff4013f8508 error 4 in splunkd[55ad1f3d2000+2e2b000]*

This is from one of the crash logs.


Received fatal signal 11 (Segmentation fault).
Cause:
No memory mapped at address [0x00000261CB7ECF].
Crashing thread: BatchSearch
"SNIP"

Backtrace (PIC build):
[0x000056345C300260] st_decode_from_vbe + 0 (splunkd + 0x2182260)
[0x000056345C2EC4DA] ? (splunkd + 0x216E4DA)
[0x000056345C2EC7EF] _seek + 143 (splunkd + 0x216E7EF)
[0x000056345C2EF4A9] and_literals + 713 (splunkd + 0x21714A9)
[0x000056345C2F3316] ? (splunkd + 0x2175316)
"SNIP"

Last errno: 2
Threads running: 11
Runtime: 52652.730678s
argv: [splunkd -p 8089 restart splunkd]
Process renamed: [splunkd pid=3960] splunkd -p 8089 restart splunkd [process-runner]

Process renamed: [splunkd pid=3960] search --id=remote_sh1_scheduler_d5331search_RMD561462962f68d150_at_1562933700_3076_AAAAAAAA-1111-2222-AAAA-ADAAA6256C5C --maxbuckets=0 --ttl=60 --maxout=0 --maxtime=0 --lookups=1 --streaming --sidtype=normal --outCsv=true --acceptSrsLevel=1 --user=d5331 --pro --roles=power:user

Tags (2)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

This could have been caused by some corrupted buckets when searches run against them.
You may want to fix the buckets and try the same search to see if it fixes it.
Follow the steps below to get list of buckets suspected corrupted.

*** How to get the list of corrupt buckets ***
1. @the indexer, cd to $SPLUNK_HOME/var/log/splunk
2. Run below
$ grep "MAP:" crash-2019-07-31*.log |grep "/opt/splunk/storage"
"/opt/splunk/storage" varies according to your deployment set up and is taken from the line below in crash log.
crash-2019-07-31-00:15:17.log:
MAP: 7f00e9cdb000-7f00ea000000 r--s 00000000 fd:03 563872524 /opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D/1560184689-1560184620-11473276039248555956.tsidx

3. It will return the problematic buckets. From the above example, the bucket location is /opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D

*** How to fix the corrupted buckets ***
Rebuilding the bucket using fsck should fix the problem. Follow the steps to rebuild buckets:
0. @CM, splunk enable maintenance-mode
1. @Indexer, splunk offline
2. @Indexer, for all the buckets from above, run splunk fsck repair --one-bucket --bucket-path="path_from_above"
i.e:
splunk fsck repair --one-bucket --bucket-path=/opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D
3. @Indexer, splunk start
4. @CM, splunk disable maintenance-mode

If this is not helping improve the situation please contact Splunk Support with details of deployment architecture and a drag from the indexer.

View solution in original post

codebuilder
SplunkTrust
SplunkTrust

That error means that a process (splunkd) has attempted to access memory that is not assigned to it. I believe this is/was a known bug in Splunk 7.1.x and below.

----
An upvote would be appreciated and Accept Solution if it helps!
0 Karma

codebuilder
SplunkTrust
SplunkTrust

Known issue (SPL-153976) and fixed as part of 7.1.3

----
An upvote would be appreciated and Accept Solution if it helps!
0 Karma

sylim_splunk
Splunk Employee
Splunk Employee

This could have been caused by some corrupted buckets when searches run against them.
You may want to fix the buckets and try the same search to see if it fixes it.
Follow the steps below to get list of buckets suspected corrupted.

*** How to get the list of corrupt buckets ***
1. @the indexer, cd to $SPLUNK_HOME/var/log/splunk
2. Run below
$ grep "MAP:" crash-2019-07-31*.log |grep "/opt/splunk/storage"
"/opt/splunk/storage" varies according to your deployment set up and is taken from the line below in crash log.
crash-2019-07-31-00:15:17.log:
MAP: 7f00e9cdb000-7f00ea000000 r--s 00000000 fd:03 563872524 /opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D/1560184689-1560184620-11473276039248555956.tsidx

3. It will return the problematic buckets. From the above example, the bucket location is /opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D

*** How to fix the corrupted buckets ***
Rebuilding the bucket using fsck should fix the problem. Follow the steps to rebuild buckets:
0. @CM, splunk enable maintenance-mode
1. @Indexer, splunk offline
2. @Indexer, for all the buckets from above, run splunk fsck repair --one-bucket --bucket-path="path_from_above"
i.e:
splunk fsck repair --one-bucket --bucket-path=/opt/splunk/storage/hot/myindex1/rb_1560184689_1559942722_7530_AAAAAAAA-BBBB-1111-8C82-ABAD1EDD033D
3. @Indexer, splunk start
4. @CM, splunk disable maintenance-mode

If this is not helping improve the situation please contact Splunk Support with details of deployment architecture and a drag from the indexer.

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...