Are there any monitoring experienced suggestions for watchdog no response errors?
We are currently are considering taking an average for the past 30 days, multiplying by 2 and if the current day exceeds that amount creating an alert. We have the response time set at default 8 seconds. And for our splunk cloud indexers (24 of them) receiving ~14k a day (which seems "holy cow" high). We found our max thread delay around 54 seconds (when running normally).
With the limited documentation available on server.conf and authorize.conf splunk documentation website, I presented a splunk support ticket. Response below.
1) What does watchdog monitor?
The Watchdog is a new feature starting from the version 7.2.x branch, that has been exposed in Splunk logging.
Watchdog is a way for Splunk to monitor internal threads created by Splunk and receive information whenever any monitored thread exceeds configured time to response (which could be caused by either long execution of a task or a large number of small tasks).
It was a functionality that was introduced from a need to:
- get more information regarding potential bottle necks and deadlocks when gathering data
- get snapshot of all nodes (all processes including searches) gathered in the same time of the incident
- get useful during development and QA testing to find performance issues before shipping Splunk
Features include :
From the configuration files you can configure
Response timeout (max time to response to the watchdog) Invoke actions when blocked/slow thread is observed .
Call stacks creation (max number, interval)
UI messaging Logging control Call stack (pstacks) generation - i.e. create stack of the blocked thread or all registered threads when a blockage is detected Enable stacks on endpoints to quickly generate call stacks of running threads
Watchdog messages are logged to $SPLUNK_HOME/var/log/watchdog/watchdog.log
Watchdog alerts are triggering WHEN there’s a BUSY thread with > 8 second response time, and do NOT trigger when there’s a busy thread < 8 secs response time nor a successfully executed thread.
Latency just implies Spunk took a little longer servicing
3) Is slow movement through the threads user impacting or not user impacting?
No. Just because some threads are busy, Splunk will still try to work around that to deliver results to the end users.
Busy threads are expected under heavy load - and the watchdog logs may help us to identify which area of Splunk is affected, but on their own and with no context, they do not indicate issues.
Maybe the message ERROR should be re-worded WARN, as it is misleading and implies there is a problem..... when there is not.