I just got this error while running fsck.
I upgraded to 4.3 and after doing the indexer it told me I should run an fsck on the indexes. After shutting down splunk and starting the fsck I got the above error.
$ /opt/splunk/bin/splunk fsck --mode metadata --all --repair bucket=/opt/splunk/var/lib/splunk/_internaldb/db/db_1327351718_1327338541_3791 file='/opt/splunk/var/lib/splunk/_internaldb/db/db_1327351718_1327338541_3791/Sources.data' code=16 contains recover-padding, repairing... src/system-alloc.cc:423] SbrkSysAllocator failed.
other buckets then continue to be worked on.
All of these boxes are the same:
$ free -m total used free shared buffers cached Mem: 16041 14974 1067 0 565 12716 -/+ buffers/cache: 1692 14349 Swap: 4000 6 3994
It looks like your system is running pretty low on memory there. Can you find out which processes are eating up all of that memory? If you'd like to specifically track Splunk memory usage, I would recommend to use the SoS app.
Wow, that is not OK. Please try the following :
- Stop splunkd
- Move bucket
/opt/splunk/bin/splunk fsck --mode metadata --all --repair again while monitoring the memory usage of splunkd
I'd be curious to know if the huge spike in memory usage for splunkd is due to fsck on this specific bucket.
Also, if you start Splunk without invoking fsck, is splunkd's memory usage normal?
To correct my previous statement, it's using 12-14GB in cached files(depending on the indexer), not just for splunkd. splunkd is using about 2GB, whether doing the fsck or not.
I rebooted that server to completely clear the cache, and tried to test again, but it didn't do anything with the fsck.
Good point, I missed that in your
free -m output. That splunkd memory usage is on the high side, but certainly not abnormal. To be clear, did the splunk fsck complete successfully this time?
Every fsck completed without errors (other than the warning given). Some took as little as 30 minutes, some took over 24 hours. This is with about 350GB of data per indexer.
Looking at your
free -m output, we can see that very little physical memory remains free on your server. Oddly enough, most of it (Almost 13GB out of your total of 16) appears to be held by the operating system for the purpose of caching.
Although I would expect things to be resolved organically given that you have enough swap, it's possible that the kernel scheduler is having trouble deciding who to push into swap to execute new processes that may require more than the remaining 1GB of free physical memory, which may result in the memory allocation error you report.
For sure, 13GB of kernel cache appears excessive. Rebooting the box to reset that and attempting splunk fsck again seem like the reasonable way to go.