I had installed a Splunk forwarder alongside an application server.
The Splunk forwarder stopped all of a sudden. We managed to get it up, but I found the below logs in splunkd just before it stopped
03-21-2017 07:50:03.565 -0400 WARN FileClassifierManager - Unable to open '/abc/logs/SystemLogs/ApplicationName_EnvironmentName_ServerName/SystemOut.log.1490097001'. 03-21-2017 07:50:07.486 -0400 WARN FileClassifierManager - The file '/abc/logs/SystemLogs/ApplicationName_EnvironmentName_ServerName/SystemOut.log.1490097001' is invalid. Reason: cannot_read 03-21-2017 07:50:07.486 -0400 INFO TailingProcessor - Ignoring file '/abc/logs/SystemLogs/ApplicationName_EnvironmentName_ServerName/SystemOut.log.1490097001' due to: cannot_read 03-21-2017 07:50:07.486 -0400 ERROR WatchedFile - About to assert due to: destroying state while still cached: state=0x0x7efdef2c72c0 wtf=0x0x7efdf0ecb800 off=0 initcrc=0x1e849f714355fa00 scrc=0x0 fallbackcrc=0x0 last_eof_time=1490097002 reschedule_target=0 is_cached=343536 fd_valid=true exists=true last_char_newline=true on_block_boundary=true only_notified_once=false was_replaced=true eof_seconds=3 unowned=false always_read=false was_too_new=false is_batch=true name="/abc/logs/SystemLogs/ApplicationName_EnvironmentName_ServerName/SystemOut.log.1490097001"
Can you please help me troubleshoot why the Splunk forwarder stopped all of a sudden?
NOTE: The indexer this forwarder was connected to continued to work. Post this time, there was no logs on the Splunk forwarder - not event internal logs
If the forwarder stops and won't restart, it can't generate any more internal Splunk logs, of course. I am not sure what you mean by "event internal logs" though.
It looks like the forwarder was unable to open a file. I doubt that is the reason that the forwarder crashed, but it is possible. More likely, the messages are related to the real reason. My guess is that some file ownership and/or permissions changed on the box, and this broke the forwarder. Things to check on the forwarder:
Run btool. A simple "./splunk btool check" will reveal typos in the configuration files, if there are any.
The forwarder runs under some user account credentials. What is this user account? Does this user account have the ability to read the '/abc/logs/SystemLogs/ApplicationNameEnvironmentNameServerName/SystemOut.log.1490097001' ?
All of the Splunk forwarder files (usually the entire /opt/splunkforwarder directory) must be owned by the user account that is running Splunk. If someone used the root account to stop/start the forwarder, some file ownerships would be changed. Reset all file ownership back to the user account, then try to restart the forwarder. IF you have the Splunk forwarder set for restart on reboot (aka boot-start), make sure that it will restart running under the user account and not root.
Watch carefully when you restart the forwarder: does it give any error messages on start up?
Thankyou very much.
There was no change in ownership. Infact, when I simply started up the forwarder again, it started working alright without any errors. The loss started flowing in almost immediately.
I'm trying to find the root cause so that it does not happen again. Could it be that the forwarder ran out of memory? How do I check for out of memory errors in the logs?
Also, by internal logs, I meant Splunkd and metrics logs. I am new to Splunk and thought the internal logs keep logging always.