Solved: Frequent service outages

sylim_splunk · ‎05-07-2020

We identified there are frequent outages on the Enterprise security server and which causes the Splunk Services down.
There is no crash log created whenever it went down. Please advise where to look into.

sylim_splunk · ‎05-07-2020

If you find no crash logs even if the splunkd crash log messages are found in splunkd.log such as below;

12-08-2020 14:08:20.238 -0400 ERROR ProcessRunner - helper process seems to have died (child killed by signal 9: Killed)!
12-08-2020 21:08:56.398 -0400 INFO ServerConfig - My GUID is F428C1-CB1F-4A95-85B5-6DD86B

It must have been killed by other signals that the applications can not handle properly - mostly sent by kernel like SIGKILL.

i) if you use initd and OOM killer enabled. Then check /var/log/messages, search for "OOM" or "out of memory" to see if it kills splunkd process. - If you use systemd the Splunk Service will restart right after the crash so there will not be much noticeable outage.
Then check which processes are using more memory compared to the others at the time of crash - Use Monitoring Console for this.

Make sure that THP is disabled,
https://docs.splunk.com/Documentation/Splunk/8.0.3/ReleaseNotes/SplunkandTHP
- Check if, at the time of crash, any heavy searches ran hogging more than usual memories using Monitoring Console. If then, you may want to implement the memory tracker for search processes to prevent the service outage. https://docs.splunk.com/Documentation/Splunk/8.0.3/Admin/limitsconf#Memory_tracker After memory_tracker is enable you can find how many searches are affected by the memory tracker, search for "Forcefully terminated search process" . https://docs.splunk.com/Documentation/Splunk/8.0.3/Search/Limitsearchprocessmemoryusage

ii) If you use systemd with splunk version prior to 7.2.2 or some of 7.2.X version the splunkd process could get killed way before it reaches the maximum memory configured. Check the systemd unit file for the parameter, MemoryLimit is accidentally set to 100G while you have a lot more allowed. Then configure it to the reasonable size .. maybe 90% of the max mem.

iii) Check the ulimit for, like open files accidentally it came up with 4096 which is system default value and suffers lack of FD.

04-17-2020 23:15:19.073 -0700 INFO ulimit - Limit: open files: 4096 files

Also check this out too - https://docs.splunk.com/Documentation/Splunk/8.0.3/Troubleshooting/ulimitErrors

There could be various splunkd log messages caused by the lack of FD, below could be one of them - "Too many open file";

04-05-2019 11:50:06.415 +1000 WARN SelfPipe - TcpChannelThread: about to throw a SelfPipeException: can't create selfpipe: **Too many open files**

iv) Also find the recommended hardware spec in the doc and try to add enough resources as per your daily usage of the server.

https://docs.splunk.com/Documentation/Splunk/8.0.3/Capacity/Referencehardware

v) If you find any crash logs in $SPLUNK_HOME/var/log/splunk then open a Splunk Support ticket along with a diag attachment for further analysis.

View solution in original post

sylim_splunk · ‎05-07-2020