Hello,
Since a few months we are facing an issue with stopping Splunk on Red Hat Linux-rel8.
We do "systemctl stop Splunkd" to stop the Splunk proces.
In most cases Splunks stops and the systemctl prompts comes back.
But sometimes (let say 1 out of 10) Splunk stops, but the systemctl prompt does not comes back.
Then, after 6 minues (the timeout in the Splunkd.service) systemctl comes back
In /var/log/messages i see this after 6 minutes.
Splunkd.service: Failed with result 'timeout'.
Stopped Systemd service file for Splunk, generated by 'splunk enable boot-start'.
In the splunkd.log i can see that Splunk has stopped. No Splunk proces is running.
With "ps -ef | grep splunk" i can see that there a no Splunk processes running.
"ps -ef | grep systemctl" i can see that systemctl is still running.
It happens on Search cluster, index cluster, Heavy Forwarders etc.
Splunk support says is it an Red Hat Linux issue and Red Hat points to Splunk.
I wonder if we are the only one who is having this issue.
Any remarks are appreciated.
Regards,
Harry
I believe I have a fix, and curious if it resolves your issue as well. I'm in close contact with Splunk Support about this, so I'm sure documentation will be coming out shortly.
Follow this documentation to enable cgroupsv2, reboot, and then disable/re-enable boot-start.
Hello everybody
I want to confirm that the fix to Enable cgroup v2 on RHEL8 has solved the issue for us as well
Regard,
Harry
If splunk hangs and there are timeout issues, it could be a number of things, what I have seen in the wild , is this normally relates to performance or the undelying storage system and the amount of ingest that may and can cause these types of issues.
Timeouts could relate to the network, whats the latency like between the Splunk instances.
How much volume of data are you ingesting, can the Splunk instances handle this?
https://docs.splunk.com/Documentation/Splunk/9.2.1/Capacity/Summaryofperformancerecommendations
1. Check that your CPU/MEM, DIsk I/O meets the requirments, if thats ok then its something else that needs investigation.
#Reference hardware
https://docs.splunk.com/Documentation/Splunk/9.2.1/Capacity/Referencehardware
2. Check that THP has been disabled - plenty of topics on this on google and this community
https://docs.splunk.com/Documentation/Splunk/9.2.1/ReleaseNotes/SplunkandTHP
3. Check that ulimits has been configured again, plenty of topics on this on google and this community
Check ulimits have been configured
Hi,
Thank you for the response
I am very sure that we fulfil these requirements.
No ingestion takes palace, because there are no Splunk processes running.
So to be clear, it is not Splunk that hangs, but the systemctl command to stop Splunkd.service.
The Splunk processes has been stopped. But the systemtl command does not comes back in the prompt.
I can see in splunkd.log that Splunk has stopped. "ps -ef splunk" : no splunk processes
Regards,
Harry
That's odd one, never seen that, I've have installed many Splunk instances on RHEL/CentOS/Fedora (7/8), over the years and not the other flavours so much, and with systemctl and not initd.
There may be a parameter that could be changed, in the Splunkd.service file, example TimeoutStopSec=360 lower this perhaps, it's not something I've done or ever had to, and only do in a lab test server and see if that makes a difference)
Other areas to further troubleshoot/investigate
Ensure the splunk user has the below: (Add to Wheel or sudoers) and see if that makes a difference
Non-root users must have super user permissions to manually configure systemd on Linux.
Non-root users must have super user permissions to run start, stop, and restart commands under systemd
Thanks fort the tips.
As a workaround i made an override for the Splunk service, to receive a timeout after 4 minutes in stead of the default 6.
We run it as root user, but concerning the sudoers file: this is something for me to investigate.
Maybe it has something to do with rights, because other application on Linux do not have this behaviour
Hi
If you are running splunk as splunk user, but use systemctl as root there is no need to add splunk into sudoers file!
I have seen same kind of behavior on some AWS ec2 instances time by time. However I haven't ever need to look it why.
Hard to say which one is the root cause splunk or systemd, probably some weird combination will cause this.
Have you log your OS logs like messages, audit etc. into splunk? If yes, then you could try to find reason from those. Another what you could try is "dmesg -T" and journalctl and look if those gives you more hints.
r. Ismo
I want to add onto this that I am also having this problem. Except the command exceeds the 360 timeout by a minute or more.
Thank you ,
Good to hear that we are not the only one
What Linux version are you running?
I'm running a RHEL8 on the latest version. We've been down the long road with Splunk support and have confirmed exhaustively that systemd is hanging on processes that aren't there. And until systemd times out (360 seconds by default), it won't actually return to you. And when Splunk does return as "stopped", it didn't actually stop, the command just timed out (journalctl -f --unit <Splunk service file>). We're working with our Linux teams and likely Red Hat Support to figure out why.
If you have any news, please update this post.
We made a support call to Red Hat without any luck
Hopefully its works for you
What OS version are you running of red hat?
I am running
Red Hat Enterprise Linux release 8.10 (Ootpa)
I have been expirimenting and noticed a massive improvement on RHEL9. Can you confirm that?
Unfortunatly i cannot confirm because all the nodes are on Lnx-8
What version are you running of Splunk? Have you tested on lower versions?
We are on 9.2.2 but the issue started on 9.x
I believe I have a fix, and curious if it resolves your issue as well. I'm in close contact with Splunk Support about this, so I'm sure documentation will be coming out shortly.
Follow this documentation to enable cgroupsv2, reboot, and then disable/re-enable boot-start.
Thanks for the advice!
We´re experiencing the same issues on same RHEL (8.10)
Will also check on our test env if this will help
Also interessted in updates, if someone find out something 🙂
Regards,
Tobias
Hi AndrewBurnett,
Thank you for keeping me updated.
I have send the link to our Linux colleagues, and will hear what they think of it.
Harry