Solved: Our 2-node indexer cluster is no longer coping whe...

lycollicott · ‎03-13-2018

We have a multi-site cluster with one node at each site. We have always seamlessly performed maintenance tasks like OS patches until recently. Whenever we do maintenance we do a splunk stop prior to anything that requires a reboot and we have never had any problems, because all of our searches run against the remaining node.

We have performed short (20 minute) maintenance twice recently and each time our searches return no data even though data exists. We did not do maintenance-mode or offline, because they were relatively quick windows. This is not unlike a cluster node crash or hardware failure, so our cluster is no longer giving us any real high availability at all.

I tested this afternoon with maintenance mode enabled, but searches still did not work.

What am I missing?

UPDATE: Site affinity is off and the factors should give a primary copy at each site, so everything should be searchable.

lycollicott · ‎03-15-2018

ARGH!!

I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.

View solution in original post

lycollicott · ‎03-15-2018

ARGH!!

I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.

gjanders · ‎03-13-2018

With the search head / search head cluster you are testing on, is it set to site0 or a particular site?
How much time was the master given to recover the cluster's state?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

lycollicott · ‎03-14-2018

It is site0. I tested yesterday and left the indexer offline for over an hour.

tiagofbmm · ‎03-14-2018

Does this happen with either of the Peers you take down?

deepashri_123 · ‎03-13-2018

Hey lycollicott,

Are the conditions for replication factor and search factor met?
You can check this in Distributed Management Console.

When you were performing the maintenance task was the cluster master also stopped?
Master is responsible to make the replicated buckets searchable if one peer is down.
Refer this link:
https://docs.splunk.com/Documentation/Splunk/7.0.2/Indexer/Whathappenswhenamasternodegoesdown

Let me know if this helps!!

lycollicott · ‎03-13-2018

The cluster master was up and the factors were met prior to the stop, so the remaining site should have been good based on the site factors.

deepashri_123 · ‎03-13-2018

What is the search affinity setting?
You can refer this doc:
http://docs.splunk.com/Documentation/Splunk/7.0.2/Indexer/Multisitesearchaffinity#Implement_search_a...

lycollicott · ‎03-14-2018

It is site0

tiagofbmm · ‎03-13-2018

What are your sites replication and search factors?

lycollicott · ‎03-13-2018

Oh, I forgot to include that in the question.

site_replication_factor = origin:1,site1:1,site2:1,total:2
site_search_factor = origin:1,site1:1,site2:1,total:2

tiagofbmm · ‎03-13-2018

Assuming that when you stopped the indexer, the cluster was in a complete state:

When you take the indexer down without the offline, you are not giving the cluster master time to redefine the searchable primary buckets, so it is possible that while Splunk is yet determining that, you can't get data from the once primary buckets of the dead indexer while splunk doesn't mark the ones in the up node as primary.

Our 2-node indexer cluster is no longer coping when we stop one node for maintenance.

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms