Monitoring Splunk

(Troubleshooting) Indexer became unresponsive today; rebooting server fixed it. A number of splunkd processes are dying and starting back up, is this normal behavior?

dpanych
Communicator

One of the six indexers we have were unresponsive today. I couldn't login through the web interface and ssh'ing to the server was very slow. I figured it's an OS problem, rebooted the server, and things seem to be clear. While looking at the logs, I noticed a number of splunkd dying. Is that normal? The server OS is RHEL 7x.

Dec 28 10:56:21 PRDSRV01 systemd[1]: systemd-journald.service: got WATCHDOG=1
Dec 28 10:56:23 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120270 (splunkd).
Dec 28 10:56:23 PRDSRV01 systemd[1]: Child 120270 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:25 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120277 (splunkd).
Dec 28 10:56:25 PRDSRV01 systemd[1]: Child 120277 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:27 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121218 (splunkd).
Dec 28 10:56:27 PRDSRV01 systemd[1]: Child 121218 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:29 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121211 (splunkd).
Dec 28 10:56:29 PRDSRV01 systemd[1]: Child 121211 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:31 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121697 (splunkd).
Dec 28 10:56:31 PRDSRV01 systemd[1]: Child 121697 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:32 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121431 (splunkd).
Dec 28 10:56:32 PRDSRV01 systemd[1]: Child 121431 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:34 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121228 (splunkd).
Dec 28 10:56:34 PRDSRV01 systemd[1]: Child 121228 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:36 PRDSRV01 systemd[1]: Received SIGCHLD from PID 122819 (splunkd).
Dec 28 10:56:36 PRDSRV01 systemd[1]: Child 122819 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:38 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121324 (splunkd).
Dec 28 10:56:38 PRDSRV01 systemd[1]: Child 121324 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:40 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120159 (splunkd).
Dec 28 10:56:40 PRDSRV01 systemd[1]: Child 120159 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:42 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120296 (splunkd).
Dec 28 10:56:42 PRDSRV01 systemd[1]: Child 120296 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:44 PRDSRV01 systemd[1]: Received SIGCHLD from PID 123182 (splunkd).
Dec 28 10:56:44 PRDSRV01 systemd[1]: Child 123182 (splunkd) died (code=exited, status=255/n/a) 
0 Karma

Masa
Splunk Employee
Splunk Employee

It is difficult to say cause of the issue.

But, according to your description,

One of the six indexers we have were unresponsive today. I couldn't login
through the web interface and ssh'ing to the server was very slow. I figured
it's an OS problem, rebooted the server, and things seem to be clear.

I believe splunk processes also got affected by the system resource/performance issue. Potentially main splunkd and child splunkd processes could not communicated at all and died.

0 Karma

mattymo
Splunk Employee
Splunk Employee

What is the ulimit setting for the user running splunk on this server??

Usually ulimits crashes will cause a crash file to be present, which I believe you said there are none, but it is worth a look.

ulimit -a

Also be sure to check splunkd.log for any errors or warns.

0 Karma

alemarzu
Motivator

Hi @dpanych,

Any crash log in $SPLUNK_HOME\var\log\splunk ?

EDIT: path

0 Karma

dpanych
Communicator

I do not see anything at that location, but I found some crash logs dated beginning of 2016 (crash-2016-xxxx) in %splunk_home%/var/log/splunk; I think those are irrelevant. Is that behavior normal as the indexer processes and indexes data?

0 Karma

alemarzu
Motivator

Oh my bad, \var\log\splunk it is. Thats not normal for sure.

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!