Deployment Architecture

RP / SF not OK after decomissionning 1 indexer - Multi site indexer clustering

veryfoot
Path Finder

Hi all, Im under Splunk Version 9.0.2.

After decomissionning one indexer in a multi site clustering, I cant retrieve my SF / RP.

A Rolling restart and CM  restart (splunkd) had no effect.

Got 3 SF tasks in pending with the same message :

Missing enough suitable candidates to create a replicated copy in order to meet replication policy. Missing={ site2:1 }

I have tried Resync and roll it with no success. 

In the details of the pending task, I can see that de bucket is only on one indexer, and not searchable on other indexers of the cluster.

My SF = 2 and RF = 2.

Id like to be clean before decomissionning the next indexer. 

Any advice or help will be hightly appreciate in order to retrieve my SF/RP (it is a production issue)

Thanks by advance

Labels (1)
0 Karma
1 Solution

veryfoot
Path Finder

Hi Splunkers,

The origin of the problem was corrupted buckets.

In my case 3 buckets were corrupted. This is what happens when analyst push some bad search request, and have killed the splunkd deamon of some indexers up and running during the decommissinning of one of them.

Check : https://docs.splunk.com/Documentation/Splunk/Latest/Troubleshooting/CommandlinetoolsforusewithSuppor...

I used the command (under the indexer where the bucket is and this indexer as to be stopped too) :

>>> splunk fsck repair [bucket_path] [index]

(use a "find /indexes/path | grep bucket_uid$ | grep [index's bucket]" to find his path)

That fsck confirm the problem.

In my case, the problem was not repairable.

So the decision have been made to delete these buckets.

The data were old, and very small, so the decision was made to delete them.

After that evrything went back to normal.

Problem solved.

Thanks for the help

View solution in original post

PickleRick
SplunkTrust
SplunkTrust

And what are your site RF/SF settings and how many indexers do you have in each site?

0 Karma

veryfoot
Path Finder

Hi and thanks for the reply.

And what are your site RF/SF > can you be more spécific please ? In the server.conf in my CM ? (I will check that when back to work tomorow.

For the sites details :

2 site with 18 indexers on each side. So 9 on one site and 8 + 1 decommissioned on the other site. 

I get back to you tomorow morning.

Regards,

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Yep.

Check the output of

splunk btool list server clustering | grep factor

 

0 Karma

veryfoot
Path Finder

 

splunk btool server list clustering | grep factor

 

Hi thanks, here is the output :

etc/system/default/servers.conf >>> ack_factor = 0

etc/apps/MULTI_SITE_APP/local/server.conf >>> replication factor = 2

etc/apps/MULTI_SITE_APP/local/server.conf >>>search_factor = 2

etc/apps/MULTI_SITE_APP/local/server.conf >>>site_replication_factor = origin:1, site1:1, site2:1 total 2

etc/apps/MULTI_SITE_APP/local/server.conf >>>site_search_factor = origin:1, site1:1, site2:1 total 2

etc/system/default/server.conf >>> replication factor = 3

Regards,

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. Looks relatively good.

Try to run

| rest splunk_server=<your_cluster_manager> /services/cluster/manager/peers
| table label site status

from your MC

0 Karma

veryfoot
Path Finder

Command passed under search of my Monitoring Console,

I have all my 17 Indexers "Up" with the right site repartion.

I dont see the decomissionned indexer who dont have any splunkd running. (Splunkd have been disabled).

Thanks

0 Karma

PickleRick
SplunkTrust
SplunkTrust

If your cluster has been recently migrated from single site to multisite there might be issues with "dangling" non-multisite buckets especially if you have constrain_singlesite_buckets=true.

Restarting CM _might_ resolve your issue but doesn't have to.

In case it doesn't it's probably a case for support.

0 Karma

veryfoot
Path Finder

Hi and thanks for the support.

>>> If your cluster has been recently migrated from single site to multisite  : NO

>>> Restarting CM _might_ resolve your issue : Already done (CM only and rolling restart with no effect for RP/SF)

The operation was to add new indexers, and then decomission the old ones.

It's a multisite since the build of everything.

But I have "constrain_singlesite_buckets=true", on the CM and on the INDX in etc/system/default/server.conf

Maybe it was buckets from the beginning of my infrastructure, at a time that the multi site cluster was not builded and operational ?

Do you know the impact of changing constrain_singlesite_buckets to false ?

Many thanks

0 Karma

PickleRick
SplunkTrust
SplunkTrust

I haven't personally done it but this docs describe migrating buckets to multisite.

https://docs.splunk.com/Documentation/Splunk/9.1.2/Indexer/Migratetomultisite#How_the_cluster_migrat...

0 Karma

veryfoot
Path Finder

Thanks,

I was reading the same page ^^

I keep u updated. I just want to verify before pushing the modificiation in the CM server.conf (only) + restart CM deamon :

[clustering]
mode = manager
constrain_singlesite_buckets = false

Do you know how to perform :

To see how many buckets will require conversion to multisite, use

services/cluster/manager/buckets?filter=multisite_bucket=false&filter=standalone=false

before changing the manager node configuration.

Thanks

Thanks

0 Karma

PickleRick
SplunkTrust
SplunkTrust

You can either use the "splunk _internal call" command on the cmdline or use

| rest /services/cluster/manager/buckets
| where multisite_bucket=0 AND standalone=0

(or "false" instead of 0, I'm not sure here)

0 Karma

veryfoot
Path Finder

Modification have been made on the CM in

etc/system/local/server.conf

Added in etc/system/local/server.conf (+restart of the CM) :

"contrain_singlesite_buckets = false"

No change.

Performed another rolling restart.

No change

Still have job in pending who are impossible to resyncronize.

Any suggestion in order to find where comes from the problem ?

Many thanks

0 Karma

veryfoot
Path Finder

Thanks for the return.

The rest command is not working for me... I have errors message in the CM search on all my peers indexers like :

[Peer1,Peer2,....] HTTP status not OK, Code=503, Service Unavailable

The web part is stopped on all my indexers... any other way to have this info ?

Thanks

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Sorry, should have mentioned what was pretty obvious to me, the rest command you should have run on the MC - properly configured MC should have access to all your components. But you should have called the rest call from MC _against_ your CM.

But still if you're stuck in that "no candidates" state, I'd suggest opening a support case.

veryfoot
Path Finder

Hi,

| rest /services/cluster/manager/buckets
| where multisite_bucket=0 AND standalone=0

From MC gives me the same error messages.

Port 8089 have been tested between MC > 8089 > Indexer(s) and is open.

I think it is because the web service is off on my inderxers... But no time to dig for this actually.

Priority is to get back SF / RP to normal green.

One thing I have notice, is when I try to resync one of these 3 jobs in pending, in the drop down menu to choose where to resync the bucket, I only have indexers of the origin's site of the bucket, and only one indexer from the other site. So I can't force réplication to another indexer in the other site...

The check of the task in pending give me for the bucket this info :

----

Replication count by site :

site 1:1

site 2:7

Search count by site

site 1:1

---

Is there a way to force replication on a desired indexer on the other side ? (because Splunk list me only one and always the same indexer on the other site)

Thanks !

0 Karma

veryfoot
Path Finder

Thanks for all your help and advices. 

I will try the rest command on the MC as u suggest tomorow, im back home now. Normally MC is well configured.

I will update after searching.

But I agree with you, I think I will have to open a case @splunk 😞

Best regards

0 Karma

veryfoot
Path Finder

One thing I have notice, is when I try to resync one of these 3 jobs in pending, in the drop down menu to choose where to resync the bucket, I only have indexers of the origin's site of the bucket, and only one indexer from the other site. So I can't force réplication to another indexer in the other site...

The "View Becket Details" in the details of the task in pending give me for the bucket this info :

----

Replication count by site :

site 1:1

site 2:7

Search count by site

site 1:1

---

Is there a way to force replication on a desired indexer on the other side ? (because Splunk list me only one and always the same indexer on the other site)

Thanks !

0 Karma

veryfoot
Path Finder

Hi Splunkers,

The origin of the problem was corrupted buckets.

In my case 3 buckets were corrupted. This is what happens when analyst push some bad search request, and have killed the splunkd deamon of some indexers up and running during the decommissinning of one of them.

Check : https://docs.splunk.com/Documentation/Splunk/Latest/Troubleshooting/CommandlinetoolsforusewithSuppor...

I used the command (under the indexer where the bucket is and this indexer as to be stopped too) :

>>> splunk fsck repair [bucket_path] [index]

(use a "find /indexes/path | grep bucket_uid$ | grep [index's bucket]" to find his path)

That fsck confirm the problem.

In my case, the problem was not repairable.

So the decision have been made to delete these buckets.

The data were old, and very small, so the decision was made to delete them.

After that evrything went back to normal.

Problem solved.

Thanks for the help

Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...