Deployment Architecture

Our 2-node indexer cluster is no longer coping when we stop one node for maintenance.

lycollicott
Motivator

We have a multi-site cluster with one node at each site. We have always seamlessly performed maintenance tasks like OS patches until recently. Whenever we do maintenance we do a splunk stop prior to anything that requires a reboot and we have never had any problems, because all of our searches run against the remaining node.

We have performed short (20 minute) maintenance twice recently and each time our searches return no data even though data exists. We did not do maintenance-mode or offline, because they were relatively quick windows. This is not unlike a cluster node crash or hardware failure, so our cluster is no longer giving us any real high availability at all.

I tested this afternoon with maintenance mode enabled, but searches still did not work.

What am I missing?

UPDATE: Site affinity is off and the factors should give a primary copy at each site, so everything should be searchable.

0 Karma
1 Solution

lycollicott
Motivator

ARGH!!

I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.

alt text

View solution in original post

lycollicott
Motivator

ARGH!!

I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.

alt text

gjanders
SplunkTrust
SplunkTrust

With the search head / search head cluster you are testing on, is it set to site0 or a particular site?
How much time was the master given to recover the cluster's state?

lycollicott
Motivator

It is site0. I tested yesterday and left the indexer offline for over an hour.

0 Karma

tiagofbmm
Influencer

Does this happen with either of the Peers you take down?

0 Karma

deepashri_123
Motivator

Hey lycollicott,

Are the conditions for replication factor and search factor met?
You can check this in Distributed Management Console.

When you were performing the maintenance task was the cluster master also stopped?
Master is responsible to make the replicated buckets searchable if one peer is down.
Refer this link:
https://docs.splunk.com/Documentation/Splunk/7.0.2/Indexer/Whathappenswhenamasternodegoesdown

Let me know if this helps!!

0 Karma

lycollicott
Motivator

The cluster master was up and the factors were met prior to the stop, so the remaining site should have been good based on the site factors.

0 Karma

deepashri_123
Motivator

lycollicott
Motivator

It is site0

0 Karma

tiagofbmm
Influencer

What are your sites replication and search factors?

0 Karma

lycollicott
Motivator

Oh, I forgot to include that in the question.

site_replication_factor = origin:1,site1:1,site2:1,total:2
site_search_factor = origin:1,site1:1,site2:1,total:2
0 Karma

tiagofbmm
Influencer

Assuming that when you stopped the indexer, the cluster was in a complete state:

When you take the indexer down without the offline, you are not giving the cluster master time to redefine the searchable primary buckets, so it is possible that while Splunk is yet determining that, you can't get data from the once primary buckets of the dead indexer while splunk doesn't mark the ones in the up node as primary.

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...