Deployment Architecture

Our 2-node indexer cluster is no longer coping when we stop one node for maintenance.

Motivator

We have a multi-site cluster with one node at each site. We have always seamlessly performed maintenance tasks like OS patches until recently. Whenever we do maintenance we do a splunk stop prior to anything that requires a reboot and we have never had any problems, because all of our searches run against the remaining node.

We have performed short (20 minute) maintenance twice recently and each time our searches return no data even though data exists. We did not do maintenance-mode or offline, because they were relatively quick windows. This is not unlike a cluster node crash or hardware failure, so our cluster is no longer giving us any real high availability at all.

I tested this afternoon with maintenance mode enabled, but searches still did not work.

What am I missing?

UPDATE: Site affinity is off and the factors should give a primary copy at each site, so everything should be searchable.

0 Karma
1 Solution

Motivator

ARGH!!

I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.

alt text

View solution in original post

Motivator

ARGH!!

I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.

alt text

View solution in original post

SplunkTrust
SplunkTrust

With the search head / search head cluster you are testing on, is it set to site0 or a particular site?
How much time was the master given to recover the cluster's state?

Motivator

It is site0. I tested yesterday and left the indexer offline for over an hour.

0 Karma

Influencer

Does this happen with either of the Peers you take down?

0 Karma

Motivator

Hey lycollicott,

Are the conditions for replication factor and search factor met?
You can check this in Distributed Management Console.

When you were performing the maintenance task was the cluster master also stopped?
Master is responsible to make the replicated buckets searchable if one peer is down.
Refer this link:
https://docs.splunk.com/Documentation/Splunk/7.0.2/Indexer/Whathappenswhenamasternodegoesdown

Let me know if this helps!!

0 Karma

Motivator

The cluster master was up and the factors were met prior to the stop, so the remaining site should have been good based on the site factors.

0 Karma

Motivator

Motivator

It is site0

0 Karma

Influencer

What are your sites replication and search factors?

0 Karma

Motivator

Oh, I forgot to include that in the question.

site_replication_factor = origin:1,site1:1,site2:1,total:2
site_search_factor = origin:1,site1:1,site2:1,total:2
0 Karma

Influencer

Assuming that when you stopped the indexer, the cluster was in a complete state:

When you take the indexer down without the offline, you are not giving the cluster master time to redefine the searchable primary buckets, so it is possible that while Splunk is yet determining that, you can't get data from the once primary buckets of the dead indexer while splunk doesn't mark the ones in the up node as primary.

0 Karma