Deployment Architecture

Hard disk Failure on One Index in a Cluster

mark_wymer
Path Finder

Hi all,

Our environment consists of, amongst other things, a multisite (3) clustered environment. Each site has three indexers making a total of nine indexers. We also have a replication factor of 3. On each indexer the hot/warm and cold buckets are on separate filesystems.

On one of the indexers, the filesystem containing the cold buckets suffered a hard disk failure which has destroyed the entire FS.

My question is: when the disk/filesystem is repaired, will Splunk automatically rebuild the cold buckets from the replications? If it does, will it do it when I start Splunk or is there some maintenance commands that I will need to issue?

Many thanks,
Mark.

0 Karma
1 Solution

nickhills
Ultra Champion

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

View solution in original post

nickhills
Ultra Champion

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

mark_wymer
Path Finder

Thanks for the answer / confirmation Nick.

0 Karma
Get Updates on the Splunk Community!

Observe and Secure All Apps with Splunk

  Join Us for Our Next Tech Talk: Observe and Secure All Apps with SplunkAs organizations continue to innovate ...

Splunk Decoded: Business Transactions vs Business IQ

It’s the morning of Black Friday, and your e-commerce site is handling 10x normal traffic. Orders are flowing, ...

Fastest way to demo Observability

I’ve been having a lot of fun learning about Kubernetes and Observability. I set myself an interesting ...