Alerting

Why are Splunk 7.x Monitoring Console alerts frequently reporting "DMC Alert - Search Peer Not Responding"?

kinaba_splunk
Splunk Employee
Splunk Employee

Splunk 7.x.x Monitoring Console Alerts are frequently reporting that one of our Indexers is "down" with a "DMC Alert - Search Peer Not Responding" alert. But I can see that the Splunk processes on this server and it has not been "restarted" in the past. It seems to be false positive.

Example)

scheduler.log:06-21-2018 04:31:01.957 +1000 INFO SavedSplunker - savedsearch_id="nobody;splunk_monitoring_console;DMC Alert - Search Peer Not Responding", search_type="scheduled", user="nobody", app="splunk_monitoring_console", savedsearch_name="DMC Alert - Search Peer Not Responding", priority=default, status=success, digest_mode=1, scheduled_time=1532268780, window_time=0, dispatch_time=1532268780, run_time=0.100, result_count=1, alert_actions="email", sid="scheduler_nobody_xxxx _RMDxxxx_at_xxxxx_xxxx", suppressed=0, thread_id="AlertNotifierWorker-0"

Could you tell me why?

0 Karma
1 Solution

kinaba_splunk
Splunk Employee
Splunk Employee

Basically, regarding this “DMC alert DMC Alert - Search Peer Not Responding”, DMC can check Peer’s status
every 5 minutes as below based on Endpoint status by rest “/services/search/distributed/peers”.

So, unfortunately, there might be chance that it might be triggered by not only peer down but also peer’s reply
not reached in timeout.
In general, when DMC alert is triggered despite peer is up, below SPL would be recommended to reduce it.

[DMC Alert - Search Peer Not Responding]
counttype = number of events
cron_schedule = 3,8,13,18,23,28,33,38,43,48,53,58 * * * *
description = One or more of your search peers is currently down.
quantity = 0
relation = greater than
search = | rest splunk_server=local /services/search/distributed/peers/ \
where status!="Up" AND disabled=0 |
fields peerName, status |
rename peerName as Instance, status as Status

Workaround:
There is 2 possible workarounds below.

(1) Set statusTimeout of distsearch.conf to be longer.
This timeout can be set longer. But the side effect would just be that it would take longer for search peers to
be considered down.
https://answers.splunk.com/answers/321592/dmc-alert-search-peer-not-responding-how-to-make-t.html

distsearch.conf: 
1. statusTimeout =  
2. * Set connection timeout when gathering a search peer's basic info (/services/server/info). 
3. * Note: Read/write timeouts are automatically set to twice this value. 
4. * Defaults to 10. 

(2) Replace SPL below with existing one.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

Before setting, just check to see if the SPL works.
1) Go to DMC > Run a Search

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

2) See if result is shown
3) Go to /opt/splunk/etc/apps/splunk_monitoring_console/default
4) vi savedsearches.conf
5) copy part of [DMC Alert - Search Peer Not Responding]
6) Go to /opt/splunk/etc/system/local
7) vi savedsearches.conf and paste the above 3.
And change name like [DMC Alert - Search Peer Not Responding2]
8) Replace following “search = “ with below.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

9) restart splunk
10) See if the above is shown on DMC > Settings > Alert Setup
11) Change the status to Enable

View solution in original post

0 Karma

kinaba_splunk
Splunk Employee
Splunk Employee

Basically, regarding this “DMC alert DMC Alert - Search Peer Not Responding”, DMC can check Peer’s status
every 5 minutes as below based on Endpoint status by rest “/services/search/distributed/peers”.

So, unfortunately, there might be chance that it might be triggered by not only peer down but also peer’s reply
not reached in timeout.
In general, when DMC alert is triggered despite peer is up, below SPL would be recommended to reduce it.

[DMC Alert - Search Peer Not Responding]
counttype = number of events
cron_schedule = 3,8,13,18,23,28,33,38,43,48,53,58 * * * *
description = One or more of your search peers is currently down.
quantity = 0
relation = greater than
search = | rest splunk_server=local /services/search/distributed/peers/ \
where status!="Up" AND disabled=0 |
fields peerName, status |
rename peerName as Instance, status as Status

Workaround:
There is 2 possible workarounds below.

(1) Set statusTimeout of distsearch.conf to be longer.
This timeout can be set longer. But the side effect would just be that it would take longer for search peers to
be considered down.
https://answers.splunk.com/answers/321592/dmc-alert-search-peer-not-responding-how-to-make-t.html

distsearch.conf: 
1. statusTimeout =  
2. * Set connection timeout when gathering a search peer's basic info (/services/server/info). 
3. * Note: Read/write timeouts are automatically set to twice this value. 
4. * Defaults to 10. 

(2) Replace SPL below with existing one.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

Before setting, just check to see if the SPL works.
1) Go to DMC > Run a Search

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

2) See if result is shown
3) Go to /opt/splunk/etc/apps/splunk_monitoring_console/default
4) vi savedsearches.conf
5) copy part of [DMC Alert - Search Peer Not Responding]
6) Go to /opt/splunk/etc/system/local
7) vi savedsearches.conf and paste the above 3.
And change name like [DMC Alert - Search Peer Not Responding2]
8) Replace following “search = “ with below.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

9) restart splunk
10) See if the above is shown on DMC > Settings > Alert Setup
11) Change the status to Enable

0 Karma
Get Updates on the Splunk Community!

CX Day is Coming!

Customer Experience (CX) Day is on October 7th!! We're so excited to bring back another day full of wonderful ...

Strengthen Your Future: A Look Back at Splunk 10 Innovations and .conf25 Highlights!

The Big One: Splunk 10 is Here!  The moment many of you have been waiting for has arrived! We are thrilled to ...

Now Offering the AI Assistant Usage Dashboard in Cloud Monitoring Console

Today, we’re excited to announce the release of a brand new AI assistant usage dashboard in Cloud Monitoring ...