I would like to start a discussion as to how the community monitors their Splunk deployment? What are some of the methods you use?
How would you manage hundreds if not thousands of Splunk instances across multiple data centers? All of which can be clustered in groups/deployments.
I wrote the app Alerts For Splunk Admins for this purpose, some of the alerts are built upon the monitoring console, some are much more detailed, they cover the failure scenarios I've found in the past and any contributed by others.
I use a product/service called Omnicenter, from Netreo. It monitors the health of ports 8000 and 8089, and sends an email alert to my group if anything is amiss there. We use this for monitoring all of our critical systems. I wrote a small script that does a test to see if my raid array is writeble, and which puts a zero or a one in a node in the snmp tree, which Omnicenter polls regularly (we had a situation where the raid got into some weird state where it was readable but not writeable).
like any other IT system/server Splunk as well needs basic monitoring from the outside.
A good start is for sure to monitor the main splunk processes
splunkd and the Splunk helper processes. this could be done by some basic script calling the
$SPLUNK_HOME/bin/splunk status command.
You can check as well if the ports are up and running; simple telnet to Splunk ports will do the trick.
But also keep in mind that there could be much more involved like SAN, NFS, network and so on.
hope this helps ...
Does it make sense to install crontab's on 3000+ machines to watch Splunk? Automation tools like Ansible would be a great way to hit remote hosts in a massive scale. The method of how to check Splunk process is what I would hope others can share. So many tools to try out and test, looking for the perfect solution hehe.
SoS App works well.
You can make a search that checks the splunkd.log for stopped, started, etc. See below
index=_internal source=*splunkd.log host=* component=IndexProcessor ("shutting down: end" OR "Initializing: readonly") | eval restart_status=if(message="shutting down: end","Stopping","Starting")
Currently I have a script that just hits splunkd via REST API and checks if there is a return signal. If not, then the process or something is down. Would there be a better way to check over thousands of hosts?
What if the instance OR host that is running DMC goes down? DMC can only monitor a SHC if I am correct. We would need multiple DMC setup for multiple SHC's.
Ideally I want a tool that can monitor more than 10+ different SHC. Nodes ranging in the hundreds. Not even talking about all the forwarders 😞
Really want just the basic check of.. is splunkd running? If not, please let me know NOW. Looking for the least amount of footprint generated by a check.