I am working on "DMC Alert - Search Peer Not Responding" on master node. The query that it uses is as follows -
| rest splunk_server=local /services/search/distributed/peers/
| where status!="Up"
| fields peerName, host, status
| rename peerName as Instance, status as Status
The issue is - once the peer goes down, that peer gets removed from the table where this query does its search. You can see this happening in "Settings --> Distributed Search --> Search Peers". So even though the peer is down, the query does not return any result and thus does not generate an alert. Is there any fix to this?
I found another table from which the peer does not get removed when it goes down. "Settings --> Distributed Management Console --> Instances (tab next to Overview)". Can anyone suggest what query should I run so that it searches for the peer status in this table?
The issue that I face with the default rule is that there are many alerts, but there is actually no downtime. I do not see any logs as such.
There is no update on the latest version(6.4.2). It still uses the same search. When can we expect a new search?
Also, I think looking for peers that are down for more than a minute should be alerted.
First off, this is a bug (SPL-90688 to be precise) which is specific to the interactions between the Distributed Management Console and Indexer Clustering.
Basically, the Cluster Master is a bit too overzealous about its maintenance of the list of available peers and will immediately remove any peers that go down from its manifest (sometimes referred to as the "generation") whether they have been decommissioned or the victim of an outage.
This means that the alert in question, which relies on the presence of an entity in
/services/search/distributed/peers with a status other than "Up" to detect down peers, will not work for indexer cluster peers.
In a future version, we are going to fix this problem (along with the alert in question) by differentiating between a peer that is brought down by admin intervention (which would NOT trigger this alert) and one that experiences an unplanned outage (which WOULD trigger this alert).
In the meantime, here is a different search that you can use to detect search peers of the DMC that suddenly go missing or show any status other than "Up" from the perspective of the DMC's distributed search framework:
| inputlookup dmc_assets
| rename serverName AS peerName
| fields peerName peerURI
| join type=outer peerURI [rest splunk_server=local /services/search/distributed/peers
| rename title AS peerURI
| fields peerName peerURI status]
| eval status=if(isnull(status), "Missing", status)
| where status!="Up"
The main difference you'll note is that this search relies on the DMC's own asset table (the lookup table "dmc_assets") to grab the list of search peers expected instead than on the contents of the
/services/search/distributed/peers endpoint, and simply uses the latter to read peer status if available.