Monitoring Splunk

Frequent service outages

sylim_splunk
Splunk Employee
Splunk Employee

We identified there are frequent outages on the Enterprise security server and which causes the Splunk Services down.
There is no crash log created whenever it went down. Please advise where to look into.

Labels (2)
Tags (1)
1 Solution

sylim_splunk
Splunk Employee
Splunk Employee

If you find no crash logs even if the splunkd crash log messages are found in splunkd.log such as below;

12-08-2020 14:08:20.238 -0400 ERROR ProcessRunner - helper process seems to have died (child killed by signal 9: Killed)!
12-08-2020 21:08:56.398 -0400 INFO ServerConfig - My GUID is F428C1-CB1F-4A95-85B5-6DD86B

It must have been killed by other signals that the applications can not handle properly - mostly sent by kernel like SIGKILL.

i) if you use initd and OOM killer enabled. Then check /var/log/messages, search for "OOM" or "out of memory" to see if it kills splunkd process. - If you use systemd the Splunk Service will restart right after the crash so there will not be much noticeable outage.
Then check which processes are using more memory compared to the others at the time of crash - Use Monitoring Console for this.

ii) If you use systemd with splunk version prior to 7.2.2 or some of 7.2.X version the splunkd process could get killed way before it reaches the maximum memory configured. Check the systemd unit file for the parameter, MemoryLimit is accidentally set to 100G while you have a lot more allowed. Then configure it to the reasonable size .. maybe 90% of the max mem.

iii) Check the ulimit for, like open files accidentally it came up with 4096 which is system default value and suffers lack of FD.

04-17-2020 23:15:19.073 -0700 INFO ulimit - Limit: open files: 4096 files

Also check this out too - https://docs.splunk.com/Documentation/Splunk/8.0.3/Troubleshooting/ulimitErrors

There could be various splunkd log messages caused by the lack of FD, below could be one of them - "Too many open file";

04-05-2019 11:50:06.415 +1000 WARN SelfPipe - TcpChannelThread: about to throw a SelfPipeException: can't create selfpipe: **Too many open files**

iv) Also find the recommended hardware spec in the doc and try to add enough resources as per your daily usage of the server.

https://docs.splunk.com/Documentation/Splunk/8.0.3/Capacity/Referencehardware

v) If you find any crash logs in $SPLUNK_HOME/var/log/splunk then open a Splunk Support ticket along with a diag attachment for further analysis.

View solution in original post

sylim_splunk
Splunk Employee
Splunk Employee

If you find no crash logs even if the splunkd crash log messages are found in splunkd.log such as below;

12-08-2020 14:08:20.238 -0400 ERROR ProcessRunner - helper process seems to have died (child killed by signal 9: Killed)!
12-08-2020 21:08:56.398 -0400 INFO ServerConfig - My GUID is F428C1-CB1F-4A95-85B5-6DD86B

It must have been killed by other signals that the applications can not handle properly - mostly sent by kernel like SIGKILL.

i) if you use initd and OOM killer enabled. Then check /var/log/messages, search for "OOM" or "out of memory" to see if it kills splunkd process. - If you use systemd the Splunk Service will restart right after the crash so there will not be much noticeable outage.
Then check which processes are using more memory compared to the others at the time of crash - Use Monitoring Console for this.

ii) If you use systemd with splunk version prior to 7.2.2 or some of 7.2.X version the splunkd process could get killed way before it reaches the maximum memory configured. Check the systemd unit file for the parameter, MemoryLimit is accidentally set to 100G while you have a lot more allowed. Then configure it to the reasonable size .. maybe 90% of the max mem.

iii) Check the ulimit for, like open files accidentally it came up with 4096 which is system default value and suffers lack of FD.

04-17-2020 23:15:19.073 -0700 INFO ulimit - Limit: open files: 4096 files

Also check this out too - https://docs.splunk.com/Documentation/Splunk/8.0.3/Troubleshooting/ulimitErrors

There could be various splunkd log messages caused by the lack of FD, below could be one of them - "Too many open file";

04-05-2019 11:50:06.415 +1000 WARN SelfPipe - TcpChannelThread: about to throw a SelfPipeException: can't create selfpipe: **Too many open files**

iv) Also find the recommended hardware spec in the doc and try to add enough resources as per your daily usage of the server.

https://docs.splunk.com/Documentation/Splunk/8.0.3/Capacity/Referencehardware

v) If you find any crash logs in $SPLUNK_HOME/var/log/splunk then open a Splunk Support ticket along with a diag attachment for further analysis.

Get Updates on the Splunk Community!

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

The latest enhancements across the Splunk Observability portfolio deliver greater flexibility, better data and ...

Alerting Best Practices: How to Create Good Detectors

At their best, detectors and the alerts they trigger notify teams when applications aren’t performing as ...

Discover Powerful New Features in Splunk Cloud Platform: Enhanced Analytics, ...

Hey Splunky people! We are excited to share the latest updates in Splunk Cloud Platform 9.3.2408. In this ...