We have a multi-site cluster with one node at each site. We have always seamlessly performed maintenance tasks like OS patches until recently. Whenever we do maintenance we do a splunk stop prior to anything that requires a reboot and we have never had any problems, because all of our searches run against the remaining node.
We have performed short (20 minute) maintenance twice recently and each time our searches return no data even though data exists. We did not do maintenance-mode or offline, because they were relatively quick windows. This is not unlike a cluster node crash or hardware failure, so our cluster is no longer giving us any real high availability at all.
I tested this afternoon with maintenance mode enabled, but searches still did not work.
What am I missing?
UPDATE: Site affinity is off and the factors should give a primary copy at each site, so everything should be searchable.
ARGH!!
I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.
ARGH!!
I went through all the SHC nodes using the Config Quest app from Discovered Intelligence (https://splunkbase.splunk.com/app/3696/) and &%$#!(*%. I found a node with site1.
With the search head / search head cluster you are testing on, is it set to site0 or a particular site?
How much time was the master given to recover the cluster's state?
It is site0. I tested yesterday and left the indexer offline for over an hour.
Does this happen with either of the Peers you take down?
Hey lycollicott,
Are the conditions for replication factor and search factor met?
You can check this in Distributed Management Console.
When you were performing the maintenance task was the cluster master also stopped?
Master is responsible to make the replicated buckets searchable if one peer is down.
Refer this link:
https://docs.splunk.com/Documentation/Splunk/7.0.2/Indexer/Whathappenswhenamasternodegoesdown
Let me know if this helps!!
The cluster master was up and the factors were met prior to the stop, so the remaining site should have been good based on the site factors.
What is the search affinity setting?
You can refer this doc:
http://docs.splunk.com/Documentation/Splunk/7.0.2/Indexer/Multisitesearchaffinity#Implement_search_a...
It is site0
What are your sites replication and search factors?
Oh, I forgot to include that in the question.
site_replication_factor = origin:1,site1:1,site2:1,total:2
site_search_factor = origin:1,site1:1,site2:1,total:2
Assuming that when you stopped the indexer, the cluster was in a complete state:
When you take the indexer down without the offline, you are not giving the cluster master time to redefine the searchable primary buckets, so it is possible that while Splunk is yet determining that, you can't get data from the once primary buckets of the dead indexer while splunk doesn't mark the ones in the up node as primary.