Splunk (6.2.1) unexpectedly shut down and needed to be restarted. There were no issues with the server so the issue seems to be specific to Splunk. Is there a way to determine the root cause? Any specific log entries that should be checked?
Thanks for all the suggestions. It seems that it may be a memory usage related to the fact that Splunk 6.2.1 ignores user time zone setting for cron scheduled searches and runs cron scheduled searches in relation to system time instead. We have many saved searches scheduled across multiple time zones but now they are all running at the same EST time which is using more resources.
I ran into the same issue:
Feb 20 15:46:09 DCTM1 kernel: Out of memory: Kill process 27016 (splunkd) score 193 or sacrifice child
Feb 20 15:46:09 DCTM1 kernel: Killed process 27016 (splunkd) total-vm:6018500kB, anon-rss:2398292kB, file-rss:1460kB, shmem-rss:0kB
Feb 20 15:46:09 DCTM1 kernel: splunkd: page allocation failure: order:0, mode:0x201da
Feb 20 15:46:09 DCTM1 kernel: CPU: 1 PID: 27016 Comm: splunkd Not tainted 3.10.0-514.el7.x86_64 #1,I ran into the same issue on Redhat:
Feb 20 15:46:09 DCTM1 kernel: Out of memory: Kill process 27016 (splunkd) score 193 or sacrifice child
Feb 20 15:46:09 DCTM1 kernel: Killed process 27016 (splunkd) total-vm:6018500kB, anon-rss:2398292kB, file-rss:1460kB, shmem-rss:0kB
Feb 20 15:46:09 DCTM1 kernel: splunkd: page allocation failure: order:0, mode:0x201da
Feb 20 15:46:09 DCTM1 kernel: CPU: 1 PID: 27016 Comm: splunkd Not tainted 3.10.0-514.el7.x86_64 #1
Thanks for all the suggestions. It seems that it may be a memory usage related to the fact that Splunk 6.2.1 ignores user time zone setting for cron scheduled searches and runs cron scheduled searches in relation to system time instead. We have many saved searches scheduled across multiple time zones but now they are all running at the same EST time which is using more resources.
I had an issue where Red Hat was killing splunk b/c of memory. Messages where in /var/log/messages. As mentioned above _internal or even _* should help determine the cause. Look at the system activity dashboards us another good place to look.
If the splunk instance is back up and running, run a search for "index=_internal" from the time range it crashed and start looking for events.
Have you checked crash log in splunk log directory??
Check your ulimit as well, on Linux default ulimit is 1024. And splunk suggest alteast 8192 for user from splunkd is running.