If you are running Splunk with systemd and have ever upgraded your host memory, it is possible that Splunk is not using it!
When you install Splunk and have completed your build process, if you are like me, one of the last things I do is: ./splunk enable boot-start -systemd-managed 1 -user splunk
reboot the box and pat yourself on the back, maybe grab a beer.
At some point later, you add more memory to the host (or in my case, resize the EC2 instance)
EC2 is a special case which is what highlighted this for me - more on that later..
Assuming your host has swap memory, everything will work fine.
However, what you may not notice is that Splunk is heavily using swap instead of real memory.
If you disable swap swapoff -a
you will very likely see Splunk dies rapidly with OOM.
What happens is this:
When you run splunk enable boot-start
Splunk takes note of the current memory limit and writes this into the Splunkd.service unit file.
Unless configured otherwise, default linux behaviour is to allow memory overcommit using cgroups, meaning that if Splunk wants more memory than it has allocated by cgroups, all sorts of magic happens which ultimately means that swap memory gets allocated to Splunk.
If your swap is on fast IO disks, you may not notice straight away, however Splunk will never be able to use more 'real' memory than was present when you initially ran enable boot-start
Check if you are affected:
cat /etc/systemd/system/Splunkd.service
Look for the line
[Service]
...
MemoryLimit= x
Then check your system memorycat /proc/meminfo
Look the value for
MemTotal= y (kB)
Note that the unit file is in bytes, and MemTotal is in kB, but check if these values match.
If the MemoryLimit is below MemTotal then you very likely have been affected by this issue.
To Resolve:
1 Stop Splunk
2 Disable boot-start ./splunk disable boot-start
3 Re-enable boot-start ./splunk enable boot-start -systemd-managed 1 -user splunk
4 Restart Splunk
Now if you check the unit file, you should have the correct MemoryLimit applied and Splunk should be able to use all of it.
Documentation:
UPDATE:
Following my submission, the documentation has been updated to reflect this. It now reads:
The MemoryLimit value is set to the
total system memory available in bytes
when the service unit file is created.
The MemoryLimit value will not update
if the total available system memory
changes. To update the MemoryLimit
value in the unit file, you can
manually edit the value or use the
boot-start command to disable and
re-enable systemd.
http://docs.splunk.com/Documentation/Splunk/8.0.3/Admin/RunSplunkassystemdservice
The Splunk Docs team are still reviewing this, but this revised wording certainly makes it clearer.
Thanks Edward K!
EC2 corner case:
Generally speaking "swap" on ec2 sucks!
Most hosts running in AWS have storage volumes which have a limitation on IOPS.
This can make it very apparent if your system has swap on an EBS and is using lots of IO because after a period of time, your host will have consumed all of the IOP credits, then IO grinds to a halt causing high iowait.
(Tip, use htop and enable display of iowait - You will see high load averages but comparatively low CPU use unless iowait display is on)
In my case this was exactly the problem I was experiencing.
The obvious solution is more ram, less swap. So I upgraded the host, and disabled the swap.
Instantly this caused OOM and so began much head scratching, digging and ultimately discovery of where this was occurring
Following my submission, the documentation has been updated to reflect this. It now reads:
The MemoryLimit value is set to the
total system memory available in bytes
when the service unit file is created.
The MemoryLimit value will not update
if the total available system memory
changes. To update the MemoryLimit
value in the unit file, you can
manually edit the value or use the
boot-start command to disable and
re-enable systemd.
http://docs.splunk.com/Documentation/Splunk/8.0.3/Admin/RunSplunkassystemdservice
The Splunk Docs team are still reviewing this, but this revised wording certainly makes it clearer.
Thanks Edward K!
Following my submission, the documentation has been updated to reflect this. It now reads:
The MemoryLimit value is set to the
total system memory available in bytes
when the service unit file is created.
The MemoryLimit value will not update
if the total available system memory
changes. To update the MemoryLimit
value in the unit file, you can
manually edit the value or use the
boot-start command to disable and
re-enable systemd.
http://docs.splunk.com/Documentation/Splunk/8.0.3/Admin/RunSplunkassystemdservice
The Splunk Docs team are still reviewing this, but this revised wording certainly makes it clearer.
Thanks Edward K!
@nickhillscpl This is great. Would you mind re-writing it in question-and-answer format so it can be marked as Accepted? That will make it easier for future readers to know there's a solution here.
Hi Rich,
I had a think about how to phrase this as question/answer but couldn't really come up with a format that didn't sound patronising. 😞
Splunk used this post to re-write the documentation page, so I guess it has served its purpose, accordingly I have marked it as answered.