Splunk Enterprise

systemctl stop Splunkd hangs

harryvdtol
Explorer

Hello,

Since a few months we are facing an issue with stopping Splunk on Red Hat Linux-rel8.

We do "systemctl stop Splunkd" to stop the Splunk proces.
In most cases Splunks stops and the systemctl prompts comes back.

But sometimes (let say 1 out of 10) Splunk stops, but the systemctl prompt does not comes back.

Then, after 6 minues (the timeout in the Splunkd.service) systemctl comes back
In /var/log/messages i see this after 6 minutes.

Splunkd.service: Failed with result 'timeout'.
Stopped Systemd service file for Splunk, generated by 'splunk enable boot-start'.

In the splunkd.log i can see that Splunk has stopped. No Splunk proces is running.
With "ps -ef | grep splunk" i can see that there a no Splunk processes running.
"ps -ef | grep systemctl" i can see that systemctl is still running.

It happens on Search cluster, index cluster, Heavy Forwarders etc.

Splunk support says is it an Red Hat Linux issue and Red Hat points to Splunk.

I wonder if we are the only one who is having this issue.

Any remarks are appreciated.

Regards,

Harry

Labels (2)
0 Karma
1 Solution

AndrewBurnett
Explorer

I believe I have a fix, and curious if it resolves your issue as well. I'm in close contact with Splunk Support about this, so I'm sure documentation will be coming out shortly.

 

Follow this documentation to enable cgroupsv2, reboot, and then disable/re-enable boot-start.

https://access.redhat.com/webassets/avalon/j/includes/session/scribe/?redirectTo=https%3A%2F%2Facces...

View solution in original post

0 Karma

harryvdtol
Explorer

Hello everybody

I want to confirm that the fix to Enable cgroup v2 on RHEL8 has solved the issue for us as well

Regard,

Harry

0 Karma

deepakc
Builder

If splunk hangs and there are timeout issues, it could be a number of things, what I have seen in the wild , is this normally relates to performance or the undelying storage system and the amount of ingest that may and can cause these types of issues. 

Timeouts could relate to the network, whats the latency like between the Splunk instances. 

How much volume of data are you ingesting, can the Splunk instances handle this?
https://docs.splunk.com/Documentation/Splunk/9.2.1/Capacity/Summaryofperformancerecommendations

1. Check that your CPU/MEM, DIsk I/O meets the requirments, if thats ok then its something else that needs investigation. 

#Reference hardware
https://docs.splunk.com/Documentation/Splunk/9.2.1/Capacity/Referencehardware


2. Check that THP has been disabled - plenty of topics on this on google and this community
https://docs.splunk.com/Documentation/Splunk/9.2.1/ReleaseNotes/SplunkandTHP

3. Check that ulimits has been configured again, plenty of topics on this on google and this community
Check ulimits have been configured

0 Karma

harryvdtol
Explorer

Hi,

Thank you for the response
I am very sure that we fulfil these requirements.
No ingestion takes palace, because there are no Splunk processes running.

So to be clear, it is not Splunk that hangs, but the systemctl command to stop Splunkd.service.
The Splunk processes has been stopped. But the systemtl command does not comes back in the prompt.
I can see in splunkd.log that Splunk has stopped. "ps -ef splunk" : no splunk processes


Regards,

Harry

0 Karma

deepakc
Builder

That's odd one, never seen that, I've have installed many Splunk instances on RHEL/CentOS/Fedora (7/8), over the years and not the other flavours so much, and  with systemctl and not initd. 


There may be a parameter that could be changed, in the Splunkd.service file, example TimeoutStopSec=360 lower this perhaps, it's not something I've done or ever had to, and only do in a lab test server and see if that makes a difference)

https://docs.splunk.com/Documentation/Splunk/9.2.2/Workloads/Configuresystemd#Configure_systemd_manu... 

 
Other areas to further troubleshoot/investigate
Ensure the splunk user has the below: (Add to Wheel or sudoers) and see if that makes a difference
Non-root users must have super user permissions to manually configure systemd on Linux.
Non-root users must have super user permissions to run start, stop, and restart commands under systemd

https://docs.splunk.com/Documentation/Splunk/9.2.2/Workloads/Configuresystemd#Configure_systemd_manu... 

0 Karma

harryvdtol
Explorer

Thanks fort the tips.

As a workaround i made an override for the Splunk service, to receive a timeout after 4 minutes in stead of the default 6.

We run it as root user, but concerning the sudoers file: this is something for me to investigate.
Maybe it has something to do with rights, because other application on Linux do not have this behaviour

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

If you are running splunk as splunk user, but use systemctl as root there is no need to add splunk into sudoers file!

I have seen same kind of behavior on some AWS ec2 instances time by time. However I haven't ever need to look it why. 

Hard to say which one is the root cause splunk or systemd, probably some weird combination will cause this.

Have you log your OS logs like messages, audit etc. into splunk? If yes, then you could try to find reason from those. Another what you could try is "dmesg -T" and journalctl and look if those gives you more hints.

r. Ismo

0 Karma

AndrewBurnett
Explorer

I want to add onto this that I am also having this problem. Except the command exceeds the 360 timeout by a minute or more.

0 Karma

harryvdtol
Explorer

Thank you ,

Good to hear that we are not the only one


What Linux version are you running?

 

0 Karma

AndrewBurnett
Explorer

I'm running a RHEL8 on the latest version. We've been down the long road with Splunk support and have confirmed exhaustively that systemd is hanging on processes that aren't there. And until systemd times out (360 seconds by default), it won't actually return to you. And when Splunk does return as "stopped", it didn't actually stop, the command just timed out (journalctl -f --unit <Splunk service file>). We're working with our Linux teams and likely Red Hat Support to figure out why.

0 Karma

harryvdtol
Explorer

If you have any news, please update this post.

We made a support call to Red Hat without any luck
Hopefully its works for you

0 Karma

AndrewBurnett
Explorer

What OS version are you running of red hat? 

0 Karma

harryvdtol
Explorer

I am running 


Red Hat Enterprise Linux release 8.10 (Ootpa)

 

0 Karma

AndrewBurnett
Explorer

I have been expirimenting and noticed a massive improvement on RHEL9. Can you confirm that?

0 Karma

harryvdtol
Explorer

Unfortunatly i cannot confirm because all the nodes are on Lnx-8

0 Karma

AndrewBurnett
Explorer

What version are you running of Splunk? Have you tested on lower versions?

0 Karma

harryvdtol
Explorer

We are on 9.2.2  but the issue started on 9.x

0 Karma

AndrewBurnett
Explorer

I believe I have a fix, and curious if it resolves your issue as well. I'm in close contact with Splunk Support about this, so I'm sure documentation will be coming out shortly.

 

Follow this documentation to enable cgroupsv2, reboot, and then disable/re-enable boot-start.

https://access.redhat.com/webassets/avalon/j/includes/session/scribe/?redirectTo=https%3A%2F%2Facces...

0 Karma

tobiasgoevert
Engager

Thanks for the advice!

 

 We´re experiencing the same issues on same RHEL (8.10)

Will also check on our test env if this will help

 

Also interessted in updates, if someone find out something 🙂 

 

Regards, 

Tobias

0 Karma

harryvdtol
Explorer

Hi AndrewBurnett,


Thank you for keeping me updated.

I have send the link to our Linux colleagues, and will hear what they think of it.

Harry


0 Karma
Get Updates on the Splunk Community!

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

The latest enhancements across the Splunk Observability portfolio deliver greater flexibility, better data and ...

Alerting Best Practices: How to Create Good Detectors

At their best, detectors and the alerts they trigger notify teams when applications aren’t performing as ...

Discover Powerful New Features in Splunk Cloud Platform: Enhanced Analytics, ...

Hey Splunky people! We are excited to share the latest updates in Splunk Cloud Platform 9.3.2408. In this ...