Deployment Architecture

Hard disk Failure on One Index in a Cluster

mark_wymer
Path Finder

Hi all,

Our environment consists of, amongst other things, a multisite (3) clustered environment. Each site has three indexers making a total of nine indexers. We also have a replication factor of 3. On each indexer the hot/warm and cold buckets are on separate filesystems.

On one of the indexers, the filesystem containing the cold buckets suffered a hard disk failure which has destroyed the entire FS.

My question is: when the disk/filesystem is repaired, will Splunk automatically rebuild the cold buckets from the replications? If it does, will it do it when I start Splunk or is there some maintenance commands that I will need to issue?

Many thanks,
Mark.

0 Karma
1 Solution

nickhills
Ultra Champion

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

View solution in original post

nickhills
Ultra Champion

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

mark_wymer
Path Finder

Thanks for the answer / confirmation Nick.

0 Karma
Get Updates on the Splunk Community!

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

If you’re unfamiliar, .conf is Splunk’s premier event where the Splunk community, customers, partners, and ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey

There’s something special about this time of year—maybe it’s the glow of the holidays, maybe it’s the ...

Announcing the Migration of the Splunk Add-on for Microsoft Azure Inputs to ...

Announcing the Migration of the Splunk Add-on for Microsoft Azure Inputs to Officially Supported Splunk ...