Solved: Splunk not starting when $SPLUNK_DB changed

diligentpenguin · ‎06-26-2020

I have encountered a problem where I cannot get the Splunk service to start after changing The $SPLUNK_DB variable in /opt/splunk/etc/splunk-launch.conf.

What I’ve tried and further background information:

I have verified that the following steps work successfully if the $SPLUNK_DB variable is NOT set. In other words, it defaults to $SPLUNK_HOME/var/lib/splunk

systemctl stop Splunkd.service

systemctl start Splunkd.service

But once I edit the $SPLUNK_DB variable, I cannot get Splunk to start. Likewise, Splunk will not start after reboot if the $SPLUNK_DB is set. It will start after reboot if this variable is not set.

The $SPLUNK_DB variable is set to /mnt/splunk, a CIFS share that I have verified is mounted and can be accessed by the system. (For the curious, this is a testing environment for me to learn Splunk. Splunk is installed on a small NUC with a decent processor and RAM but there’s a single consumer SSD drive with limited space. The CIFS share is on a NAS with multiple terabytes of extra space. I know performance won’t be great, but then, neither will the flow of data.)

Next I tried switching to the splunk user (because that seems to be the user that owns the files in the /opt/splunk directory), to see if the issue was a permissions problem. I used sudo su - splunk. I verified that I can indeed create, write, and read, and delete files from /mnt/splunk as the splunk user, root user, and my personal user on Linux. Conclusion: it doesn’t seem to be a permissions problem.

Curiously, when I changed the conf file while splunk was running, Splunk created a series of directories and subdirectories inside /mnt/splunk. I can see top level directories of audit, authDb, and hashDb. (There’s no data in them as I don’t have Splunk setup to receive any data yet.)

I tried the following search of all the log files hoping I would find clues about why this database path was causing me trouble.

/opt/splunk/var/log/splunk# cat *.log | grep 'mnt/splunk'

It found nothing. (But if I search instead for the default db path, 'var/lib/splunk', I find dozens or hundreds of entries. (So the search works.)

I’m at a loss. Are there other steps I should take beyond changing the path to $SPLUNK_DB? Is there anything I can do to understand why Splunk isn’t starting?

diligentpenguin · ‎06-26-2020

The plot thickens… In looking at my splunkd.log file, it dawned on me that nothing was getting generated because Splunk never even started when I tried to start it. So I looked around and learned that one can look at systemd log files. (Let the record show that I figured this out shortly before your suggestion to do basically the same thing. I’m pretty green when it comes to Linux…) The following command shows Splunk systemd entries. (The -b option means since last boot):

journalctl -u Splunkd.service -b

It reveals a lot of entries, but here’s the key entries that would appear to elucidate my problems:

Jun 26 14:31:51 splunk systemd[1]: Started Systemd service file for Splunk, generated by 'splunk enable boot-start'.

Jun 26 14:31:51 splunk splunk[42566]: Checking http port [8000]: open

Jun 26 14:31:51 splunk splunk[42566]: Checking mgmt port [8089]: open

Jun 26 14:31:52 splunk splunk[42566]: Checking appserver port [127.0.0.1:8065]: open

Jun 26 14:31:52 splunk splunk[42566]: Checking kvstore port [8191]: open

Jun 26 14:31:52 splunk splunk[42609]: Checking configuration... Done.

Jun 26 14:31:52 splunk splunk[42617]: homePath='/mnt/splunk/audit/db' of index=_audit on unusable filesystem.

Jun 26 14:31:52 splunk splunk[42617]: Checking critical directories... Done

Jun 26 14:31:52 splunk splunk[42617]: Checking indexes...

Jun 26 14:31:52 splunk splunk[42566]: Validating databases (splunkd validatedb) failed with code '1'. If you cannot resolve the issue(s) above after consulting documentation, please file a case online at http://www.splunk.com/page/submit_issue

Jun 26 14:31:52 splunk systemd[1]: Splunkd.service: Main process exited, code=exited, status=10/n/a

Jun 26 14:31:52 splunk systemd[1]: Splunkd.service: Failed with result 'exit-code'.

Further digging indicates that CIFS is not supported. And neither really is NFS.

Here’s a discussion on the issue.

And official documentation at the following URL:

https://docs.splunk.com/Documentation/Splunk/8.0.4/Installation/Systemrequirements

I also found this gem, which sure enough explains that Splunk is designed to not start when it detects unsupported file systems and that you can bypass this check at your own risk with the hilariously named OPTIMISTIC_ABOUT_FILE_LOCKING=1 variable in the splunk-launch.conf file.

So now I need to figure out how to move forward. It would seem my options are:

1) Run with CIFS and hope for the best? The data isn't critical, but my learning is going to be royally interrupted when the whole db is corrupted and I have to trash it and start over.

2) Use NFS (which I can, in theory, enable on my NAS.)

3) Attach an external disk via USB to my little NUC and use that for storage?

4) Use the limited space on my SSD drive for hot and warm buckets, see how fast they fill up, and then setup cold and frozen buckets via NFS.

I lean toward option 4.

Anyway, thanks for the help. I just wanted to write down what I’d found in case it helps someone else.

View solution in original post

richgalloway · ‎06-26-2020

What did you find in splunkd.log?

---
If this reply helps you, Karma would be appreciated.

diligentpenguin · ‎06-26-2020

Because the log file was rather large, here's the output from the current hour only. Right before grabbing the output, I changed the $SPLUNK_DB variable and tried (unsuccessfully again) to start splunk using the same command I referenced above.

I was going to attach the file but the forums only allow jpg, gif, and png. Uggh. So then I tried pasting but was denied because it exceeded 20,000 characters. It seems it's just not my day.

So here's the partial log, as a Dropbox link.

Thanks!

richgalloway · ‎06-26-2020

So it looks like there are several restarts of Splunk in that log. Just one would have made for smaller pasting.

There are very few errors in the log and none of them seem related to Splunk not staying up. However, these lines start a very relevant section:

06-26-2020 16:18:19.951 +0000 INFO IndexProcessor - handleSignal : Disabling streaming searches.
06-26-2020 16:18:19.951 +0000 INFO IndexProcessor - request state change from=RUN to=SHUTDOWN_SIGNALED
06-26-2020 16:18:19.951 +0000 INFO UiHttpListener - Shutting down webui
06-26-2020 16:18:19.951 +0000 INFO UiHttpListener - Shutting down webui completed
06-26-2020 16:18:24.958 +0000 INFO loader - Shutdown HTTPDispatchThread
06-26-2020 16:18:24.959 +0000 INFO ShutdownHandler - Shutting down splunkd

It seems that Splunk is receiving a signal to shutdown and so it does so. I suspect this signal is coming from systemd. Check the systemctl log (run journalctl).

I've seen a case where systemd would shutdown Splunk because an external drive was not ready, yet. IIRC, the fix involved using the "After" option to make sure the drive was mounted before starting Splunk as well as an ExecStartPre option to force a mount of the external drive, just to make sure it's there. I'll see if I can find the details.

---
If this reply helps you, Karma would be appreciated.

diligentpenguin · ‎06-26-2020