Solved: How do I rebalance an index cluster with storage d...

sndblstr · ‎09-30-2022

Hello everyone,

I am fairly new to Splunk and learning on the fly, so it would be super nice of someone can help me solve this issue or guide me how to deal with it for now.

We had index cluster of 5 nodes in 2 sites.
Site1 has 2 nodes with 9.1 TB each and Site 2 has 3 nodes with 20TB each... don't ask me why. There was some confusion with the initial installation of the machines.
So we have almost full disk in site 1 - 98% of storage used and around 50% in machines in site2.
Yesterday we added another in instance to site 1 with 20TB of disk, but it does not seem to offload the other 2 in site1.
What are our options here? Shall we run index rebalancing from the manager node?
Every guidance will be much appreciated.

Regards

isoutamo · ‎09-30-2022

Ok. I suppose that you have restart splunk after that resize has done?

Next check if your splunk is configured to use volumes in indexes.conf. You can check it from any index stanza to look if there is in any path like volume:<something> (there should be tstatsHomePath = volume:_splunk_summaries/...., but you should look if there are another for homePath and/or coldPath). If there are then try to find that volume definitions e.g.

splunk btool indexes list volume:<your volume name> --debug

That shows file where it has configured (run this in any indexer nodes). The path should be something like /opt/splunk/etc/slave-apps/.... Then look on CM in /opt/splunk/etc/master-apps/<path from previous command> and change volumes size to correct. It should be something like df -BM /<path>/to/your/splunk DB dir - 10-20% of it's total size.

After that deploy it to the cluster. IF that definition has done on somewhere else that slave-apps/peer-apps on. your indexers then I propose that ask help from Splunk PS or some local Splunk Partner.

View solution in original post

sndblstr · ‎09-30-2022

Hello isoutamo,

Thank you for answer. We have increased the disks of the two nodes in site1 to 20tb, but the management still sees them as 9tb. We haven't restarted splunk services on them. Shall we?

Also the new one that we added does not seem to be working:

"Unable to distribute to peer named new-server at uri=new-server:8089 using the uri-scheme=https because peer has status=Down. Verify uri-scheme, connectivity to the search peer, that the search peer is up, and that an adequate level of system resources are available. See the Troubleshooting Manual for more information."

I can confirm that the service is running and listening on port 8089 and also reachable using telnet.

To answer your question about the replication factor - it is origin:1 , site1:1 , site2:1 , total:2

Regards

isoutamo · ‎09-30-2022

What you are actually meaning with "increased the disks of the two nodes in site1 to 20tb"? Just added disk to node and add it to filesystem what splunk is using for it's SPLUNK_DB? If/when (you should) have volumes defined on indexes.conf have you also increased that size on CM and then deployed a new value to indexers? Or how you have managed that when your original setup those nodes have different sized volumes?

When you are saying "almost full disk" how you have measured it? From splunk or from command line with df/vgs or what ever tool you are using?

Anyhow in cluster you should (read must) have equal sized disks on all nodes. Otherwise you will hit with issues sooner or later. So my proposal is that you should add additional disk space to all those nodes which have only 9.1 vs 20TB so that all have that 20TB which can used by splunk. Also you should use splunk volumes to manage max amount of disk space in use to avoid "out of space" situation. All those must controlled and managed by CM not on individual host!

Also I suggest you to use linux LVM on splunk spaces even you are running splunk on cloud or other IaaC service.

sndblstr · ‎09-30-2022

Hello,

What is that the disk itself was 20tb but the colleagues that have setup it at the beginning somehow created /dev/sda4 where /opt/splunk is mounted only 9tb, so we increased the /dev/sda4 to 20tb now.
We measure it with df -h

LVM was not used in this case, unfortunately.

isoutamo · ‎09-30-2022

Ok. I suppose that you have restart splunk after that resize has done?

Next check if your splunk is configured to use volumes in indexes.conf. You can check it from any index stanza to look if there is in any path like volume:<something> (there should be tstatsHomePath = volume:_splunk_summaries/...., but you should look if there are another for homePath and/or coldPath). If there are then try to find that volume definitions e.g.

splunk btool indexes list volume:<your volume name> --debug

That shows file where it has configured (run this in any indexer nodes). The path should be something like /opt/splunk/etc/slave-apps/.... Then look on CM in /opt/splunk/etc/master-apps/<path from previous command> and change volumes size to correct. It should be something like df -BM /<path>/to/your/splunk DB dir - 10-20% of it's total size.

After that deploy it to the cluster. IF that definition has done on somewhere else that slave-apps/peer-apps on. your indexers then I propose that ask help from Splunk PS or some local Splunk Partner.

isoutamo · ‎09-30-2022

Hi

1st what are your site_replication and site_search factors? Those are defining how data/buckets is shared between sites. Also how how are you UFs/HFs connecting to this cluster?

Here is couple of old answers which are somehow related to your issue:

r. Ismo

How do I rebalance an index cluster with storage discrepancy between members?

administration

upgrade

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?

Splunk Education Goes to Washington | Splunk GovSummit 2024