Splunk Search

"SbrkSysAllocator failed" while performing splunk fsck

hodsonc
Explorer

I just got this error while running fsck.

I upgraded to 4.3 and after doing the indexer it told me I should run an fsck on the indexes. After shutting down splunk and starting the fsck I got the above error.

  $ /opt/splunk/bin/splunk fsck --mode metadata --all --repair
bucket=/opt/splunk/var/lib/splunk/_internaldb/db/db_1327351718_1327338541_3791 file='/opt/splunk/var/lib/splunk/_internaldb/db/db_1327351718_1327338541_3791/Sources.data' code=16 contains recover-padding, repairing...
src/system-alloc.cc:423] SbrkSysAllocator failed.

other buckets then continue to be worked on.

More info:

  • I have now done this on 5 of our indexers and it has happened on each one.
  • One of the fsck processes was killed (not sure why right now) and when I restarted the fsck it gave the above error on the first index it started with (a different index than the first run)

All of these boxes are the same:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         16041      14974       1067          0        565      12716
-/+ buffers/cache:       1692      14349
Swap:         4000          6       3994
Tags (2)

hexx
Splunk Employee
Splunk Employee

Looking at your free -m output, we can see that very little physical memory remains free on your server. Oddly enough, most of it (Almost 13GB out of your total of 16) appears to be held by the operating system for the purpose of caching.

Although I would expect things to be resolved organically given that you have enough swap, it's possible that the kernel scheduler is having trouble deciding who to push into swap to execute new processes that may require more than the remaining 1GB of free physical memory, which may result in the memory allocation error you report.

For sure, 13GB of kernel cache appears excessive. Rebooting the box to reset that and attempting splunk fsck again seem like the reasonable way to go.

hexx
Splunk Employee
Splunk Employee

Well, I think we can call this one solved. I'll post an answer to summarize my conclusions.

0 Karma

hodsonc
Explorer

Every fsck completed without errors (other than the warning given). Some took as little as 30 minutes, some took over 24 hours. This is with about 350GB of data per indexer.

0 Karma

hexx
Splunk Employee
Splunk Employee

Good point, I missed that in your free -m output. That splunkd memory usage is on the high side, but certainly not abnormal. To be clear, did the splunk fsck complete successfully this time?

0 Karma

hodsonc
Explorer

To correct my previous statement, it's using 12-14GB in cached files(depending on the indexer), not just for splunkd. splunkd is using about 2GB, whether doing the fsck or not.

I rebooted that server to completely clear the cache, and tried to test again, but it didn't do anything with the fsck.

0 Karma

hexx
Splunk Employee
Splunk Employee

Wow, that is not OK. Please try the following :

- Stop splunkd
- Move bucket /opt/splunk/var/lib/splunk/_internaldb/db/db_1327351718_1327338541_3791 to /var/tmp
- run /opt/splunk/bin/splunk fsck --mode metadata --all --repair again while monitoring the memory usage of splunkd

I'd be curious to know if the huge spike in memory usage for splunkd is due to fsck on this specific bucket.

Also, if you start Splunk without invoking fsck, is splunkd's memory usage normal?

0 Karma

hodsonc
Explorer

That's after starting up Splunk, so it's splunkd that's using 14G.

0 Karma

hexx
Splunk Employee
Splunk Employee

It looks like your system is running pretty low on memory there. Can you find out which processes are eating up all of that memory? If you'd like to specifically track Splunk memory usage, I would recommend to use the SoS app.

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...