Solved: RP / SF not OK after decomissionning 1 indexer - ...

veryfoot · ‎01-21-2024

Hi all, Im under Splunk Version 9.0.2.

After decomissionning one indexer in a multi site clustering, I cant retrieve my SF / RP.

A Rolling restart and CM restart (splunkd) had no effect.

Got 3 SF tasks in pending with the same message :

Missing enough suitable candidates to create a replicated copy in order to meet replication policy. Missing={ site2:1 }

I have tried Resync and roll it with no success.

In the details of the pending task, I can see that de bucket is only on one indexer, and not searchable on other indexers of the cluster.

My SF = 2 and RF = 2.

Id like to be clean before decomissionning the next indexer.

Any advice or help will be hightly appreciate in order to retrieve my SF/RP (it is a production issue)

Thanks by advance

veryfoot · ‎01-28-2024

Hi Splunkers,

The origin of the problem was corrupted buckets.

In my case 3 buckets were corrupted. This is what happens when analyst push some bad search request, and have killed the splunkd deamon of some indexers up and running during the decommissinning of one of them.

Check : https://docs.splunk.com/Documentation/Splunk/Latest/Troubleshooting/CommandlinetoolsforusewithSuppor...

I used the command (under the indexer where the bucket is and this indexer as to be stopped too) :

>>> splunk fsck repair [bucket_path] [index]

(use a "find /indexes/path | grep bucket_uid$ | grep [index's bucket]" to find his path)

That fsck confirm the problem.

In my case, the problem was not repairable.

So the decision have been made to delete these buckets.

The data were old, and very small, so the decision was made to delete them.

After that evrything went back to normal.

Problem solved.

Thanks for the help

View solution in original post

PickleRick · ‎01-21-2024

And what are your site RF/SF settings and how many indexers do you have in each site?

veryfoot · ‎01-21-2024

Hi and thanks for the reply.

And what are your site RF/SF > can you be more spécific please ? In the server.conf in my CM ? (I will check that when back to work tomorow.

For the sites details :

2 site with 18 indexers on each side. So 9 on one site and 8 + 1 decommissioned on the other site.

I get back to you tomorow morning.

Regards,

PickleRick · ‎01-21-2024

Yep.

Check the output of

splunk btool list server clustering | grep factor

veryfoot · ‎01-22-2024

splunk btool server list clustering | grep factor

Hi thanks, here is the output :

etc/system/default/servers.conf >>> ack_factor = 0

etc/apps/MULTI_SITE_APP/local/server.conf >>> replication factor = 2

etc/apps/MULTI_SITE_APP/local/server.conf >>>search_factor = 2

etc/apps/MULTI_SITE_APP/local/server.conf >>>site_replication_factor = origin:1, site1:1, site2:1 total 2

etc/apps/MULTI_SITE_APP/local/server.conf >>>site_search_factor = origin:1, site1:1, site2:1 total 2

etc/system/default/server.conf >>> replication factor = 3

Regards,

PickleRick · ‎01-22-2024

OK. Looks relatively good.

Try to run

| rest splunk_server=<your_cluster_manager> /services/cluster/manager/peers
| table label site status

from your MC

veryfoot · ‎01-22-2024

Command passed under search of my Monitoring Console,

I have all my 17 Indexers "Up" with the right site repartion.

I dont see the decomissionned indexer who dont have any splunkd running. (Splunkd have been disabled).

Thanks

PickleRick · ‎01-22-2024

If your cluster has been recently migrated from single site to multisite there might be issues with "dangling" non-multisite buckets especially if you have constrain_singlesite_buckets=true.

Restarting CM _might_ resolve your issue but doesn't have to.

In case it doesn't it's probably a case for support.

veryfoot · ‎01-22-2024

Hi and thanks for the support.

>>> If your cluster has been recently migrated from single site to multisite : NO

>>> Restarting CM _might_ resolve your issue : Already done (CM only and rolling restart with no effect for RP/SF)

The operation was to add new indexers, and then decomission the old ones.

It's a multisite since the build of everything.

But I have "constrain_singlesite_buckets=true", on the CM and on the INDX in etc/system/default/server.conf

Maybe it was buckets from the beginning of my infrastructure, at a time that the multi site cluster was not builded and operational ?

Do you know the impact of changing constrain_singlesite_buckets to false ?

Many thanks

PickleRick · ‎01-22-2024

I haven't personally done it but this docs describe migrating buckets to multisite.

https://docs.splunk.com/Documentation/Splunk/9.1.2/Indexer/Migratetomultisite#How_the_cluster_migrat...

veryfoot · ‎01-22-2024

Thanks,

I was reading the same page ^^

I keep u updated. I just want to verify before pushing the modificiation in the CM server.conf (only) + restart CM deamon :

[clustering]
mode = manager
constrain_singlesite_buckets = false

Do you know how to perform :

To see how many buckets will require conversion to multisite, use

services/cluster/manager/buckets?filter=multisite_bucket=false&filter=standalone=false

before changing the manager node configuration.

Thanks

PickleRick · ‎01-22-2024

You can either use the "splunk _internal call" command on the cmdline or use

| rest /services/cluster/manager/buckets
| where multisite_bucket=0 AND standalone=0

(or "false" instead of 0, I'm not sure here)

veryfoot · ‎01-22-2024

Modification have been made on the CM in

etc/system/local/server.conf

Added in etc/system/local/server.conf (+restart of the CM) :

"contrain_singlesite_buckets = false"

No change.

Performed another rolling restart.

No change

Still have job in pending who are impossible to resyncronize.

Any suggestion in order to find where comes from the problem ?

Many thanks

veryfoot · ‎01-22-2024

Thanks for the return.

The rest command is not working for me... I have errors message in the CM search on all my peers indexers like :

[Peer1,Peer2,....] HTTP status not OK, Code=503, Service Unavailable

The web part is stopped on all my indexers... any other way to have this info ?

Thanks

PickleRick · ‎01-22-2024

Sorry, should have mentioned what was pretty obvious to me, the rest command you should have run on the MC - properly configured MC should have access to all your components. But you should have called the rest call from MC _against_ your CM.

But still if you're stuck in that "no candidates" state, I'd suggest opening a support case.

veryfoot · ‎01-22-2024

Hi,

| rest /services/cluster/manager/buckets
| where multisite_bucket=0 AND standalone=0

From MC gives me the same error messages.

Port 8089 have been tested between MC > 8089 > Indexer(s) and is open.

I think it is because the web service is off on my inderxers... But no time to dig for this actually.

Priority is to get back SF / RP to normal green.

One thing I have notice, is when I try to resync one of these 3 jobs in pending, in the drop down menu to choose where to resync the bucket, I only have indexers of the origin's site of the bucket, and only one indexer from the other site. So I can't force réplication to another indexer in the other site...

The check of the task in pending give me for the bucket this info :

----

Replication count by site :

site 1:1

site 2:7

Search count by site

site 1:1

---

Is there a way to force replication on a desired indexer on the other side ? (because Splunk list me only one and always the same indexer on the other site)

Thanks !

veryfoot · ‎01-22-2024

Thanks for all your help and advices.

I will try the rest command on the MC as u suggest tomorow, im back home now. Normally MC is well configured.

I will update after searching.

But I agree with you, I think I will have to open a case @splunk 😞

Best regards

veryfoot · ‎01-22-2024

One thing I have notice, is when I try to resync one of these 3 jobs in pending, in the drop down menu to choose where to resync the bucket, I only have indexers of the origin's site of the bucket, and only one indexer from the other site. So I can't force réplication to another indexer in the other site...

The "View Becket Details" in the details of the task in pending give me for the bucket this info :

----

Replication count by site :

site 1:1

site 2:7

Search count by site

site 1:1

---

Is there a way to force replication on a desired indexer on the other side ? (because Splunk list me only one and always the same indexer on the other site)

Thanks !

veryfoot · ‎01-28-2024

Hi Splunkers,

The origin of the problem was corrupted buckets.

In my case 3 buckets were corrupted. This is what happens when analyst push some bad search request, and have killed the splunkd deamon of some indexers up and running during the decommissinning of one of them.

Check : https://docs.splunk.com/Documentation/Splunk/Latest/Troubleshooting/CommandlinetoolsforusewithSuppor...

I used the command (under the indexer where the bucket is and this indexer as to be stopped too) :

>>> splunk fsck repair [bucket_path] [index]

(use a "find /indexes/path | grep bucket_uid$ | grep [index's bucket]" to find his path)

That fsck confirm the problem.

In my case, the problem was not repairable.

So the decision have been made to delete these buckets.

The data were old, and very small, so the decision was made to delete them.

After that evrything went back to normal.

Problem solved.

Thanks for the help

RP / SF not OK after decomissionning 1 indexer - Multi site indexer clustering

indexer clustering

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio