Deployment Architecture

"DMC Alert - Search Peer Not Responding" not working as expected

malhar_desai
Engager

I am working on "DMC Alert - Search Peer Not Responding" on master node. The query that it uses is as follows -

| rest splunk_server=local /services/search/distributed/peers/
| where status!="Up"
| fields peerName, host, status
| rename peerName as Instance, status as Status

The issue is - once the peer goes down, that peer gets removed from the table where this query does its search. You can see this happening in "Settings --> Distributed Search --> Search Peers". So even though the peer is down, the query does not return any result and thus does not generate an alert. Is there any fix to this?

I found another table from which the peer does not get removed when it goes down. "Settings --> Distributed Management Console --> Instances (tab next to Overview)". Can anyone suggest what query should I run so that it searches for the peer status in this table?

Tags (3)

Securaction
Loves-to-Learn Everything

Any update on this issue ?

0 Karma

gautham
Explorer

Hi Hexx,

The issue that I face with the default rule is that there are many alerts, but there is actually no downtime. I do not see any logs as such.

There is no update on the latest version(6.4.2). It still uses the same search. When can we expect a new search?

Also, I think looking for peers that are down for more than a minute should be alerted.

Thank you
Gautham.

hexx
Splunk Employee
Splunk Employee

First off, this is a bug (SPL-90688 to be precise) which is specific to the interactions between the Distributed Management Console and Indexer Clustering.

Basically, the Cluster Master is a bit too overzealous about its maintenance of the list of available peers and will immediately remove any peers that go down from its manifest (sometimes referred to as the "generation") whether they have been decommissioned or the victim of an outage.

This means that the alert in question, which relies on the presence of an entity in /services/search/distributed/peers with a status other than "Up" to detect down peers, will not work for indexer cluster peers.

In a future version, we are going to fix this problem (along with the alert in question) by differentiating between a peer that is brought down by admin intervention (which would NOT trigger this alert) and one that experiences an unplanned outage (which WOULD trigger this alert).

In the meantime, here is a different search that you can use to detect search peers of the DMC that suddenly go missing or show any status other than "Up" from the perspective of the DMC's distributed search framework:


| localop
| inputlookup dmc_assets
| rename serverName AS peerName
| fields peerName peerURI
| join type=outer peerURI [rest splunk_server=local /services/search/distributed/peers
| rename title AS peerURI
| fields peerName peerURI status]
| eval status=if(isnull(status), "Missing", status)
| where status!="Up"

The main difference you'll note is that this search relies on the DMC's own asset table (the lookup table "dmc_assets") to grab the list of search peers expected instead than on the contents of the /services/search/distributed/peers endpoint, and simply uses the latter to read peer status if available.

smudge797
Path Finder

Is this resolved? Seeing similar behavior of false alerts in 6.4.5

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...