Hi all, Im under Splunk Version 9.0.2.
After decomissionning one indexer in a multi site clustering, I cant retrieve my SF / RP.
A Rolling restart and CM restart (splunkd) had no effect.
Got 3 SF tasks in pending with the same message :
Missing enough suitable candidates to create a replicated copy in order to meet replication policy. Missing={ site2:1 }
I have tried Resync and roll it with no success.
In the details of the pending task, I can see that de bucket is only on one indexer, and not searchable on other indexers of the cluster.
My SF = 2 and RF = 2.
Id like to be clean before decomissionning the next indexer.
Any advice or help will be hightly appreciate in order to retrieve my SF/RP (it is a production issue)
Thanks by advance
Hi Splunkers,
The origin of the problem was corrupted buckets.
In my case 3 buckets were corrupted. This is what happens when analyst push some bad search request, and have killed the splunkd deamon of some indexers up and running during the decommissinning of one of them.
I used the command (under the indexer where the bucket is and this indexer as to be stopped too) :
>>> splunk fsck repair [bucket_path] [index]
(use a "find /indexes/path | grep bucket_uid$ | grep [index's bucket]" to find his path)
That fsck confirm the problem.
In my case, the problem was not repairable.
So the decision have been made to delete these buckets.
The data were old, and very small, so the decision was made to delete them.
After that evrything went back to normal.
Problem solved.
Thanks for the help
And what are your site RF/SF settings and how many indexers do you have in each site?
Hi and thanks for the reply.
And what are your site RF/SF > can you be more spécific please ? In the server.conf in my CM ? (I will check that when back to work tomorow.
For the sites details :
2 site with 18 indexers on each side. So 9 on one site and 8 + 1 decommissioned on the other site.
I get back to you tomorow morning.
Regards,
Yep.
Check the output of
splunk btool list server clustering | grep factor
splunk btool server list clustering | grep factor
Hi thanks, here is the output :
etc/system/default/servers.conf >>> ack_factor = 0
etc/apps/MULTI_SITE_APP/local/server.conf >>> replication factor = 2
etc/apps/MULTI_SITE_APP/local/server.conf >>>search_factor = 2
etc/apps/MULTI_SITE_APP/local/server.conf >>>site_replication_factor = origin:1, site1:1, site2:1 total 2
etc/apps/MULTI_SITE_APP/local/server.conf >>>site_search_factor = origin:1, site1:1, site2:1 total 2
etc/system/default/server.conf >>> replication factor = 3
Regards,
OK. Looks relatively good.
Try to run
| rest splunk_server=<your_cluster_manager> /services/cluster/manager/peers
| table label site status
from your MC
Command passed under search of my Monitoring Console,
I have all my 17 Indexers "Up" with the right site repartion.
I dont see the decomissionned indexer who dont have any splunkd running. (Splunkd have been disabled).
Thanks
If your cluster has been recently migrated from single site to multisite there might be issues with "dangling" non-multisite buckets especially if you have constrain_singlesite_buckets=true.
Restarting CM _might_ resolve your issue but doesn't have to.
In case it doesn't it's probably a case for support.
Hi and thanks for the support.
>>> If your cluster has been recently migrated from single site to multisite : NO
>>> Restarting CM _might_ resolve your issue : Already done (CM only and rolling restart with no effect for RP/SF)
The operation was to add new indexers, and then decomission the old ones.
It's a multisite since the build of everything.
But I have "constrain_singlesite_buckets=true", on the CM and on the INDX in etc/system/default/server.conf
Maybe it was buckets from the beginning of my infrastructure, at a time that the multi site cluster was not builded and operational ?
Do you know the impact of changing constrain_singlesite_buckets to false ?
Many thanks
I haven't personally done it but this docs describe migrating buckets to multisite.
Thanks,
I was reading the same page ^^
I keep u updated. I just want to verify before pushing the modificiation in the CM server.conf (only) + restart CM deamon :
[clustering] mode = manager constrain_singlesite_buckets = false
Do you know how to perform :
To see how many buckets will require conversion to multisite, use
services/cluster/manager/buckets?filter=multisite_bucket=false&filter=standalone=false
before changing the manager node configuration.
Thanks
Thanks
You can either use the "splunk _internal call" command on the cmdline or use
| rest /services/cluster/manager/buckets
| where multisite_bucket=0 AND standalone=0
(or "false" instead of 0, I'm not sure here)
Modification have been made on the CM in
etc/system/local/server.conf
Added in etc/system/local/server.conf (+restart of the CM) :
"contrain_singlesite_buckets = false"
No change.
Performed another rolling restart.
No change
Still have job in pending who are impossible to resyncronize.
Any suggestion in order to find where comes from the problem ?
Many thanks
Thanks for the return.
The rest command is not working for me... I have errors message in the CM search on all my peers indexers like :
[Peer1,Peer2,....] HTTP status not OK, Code=503, Service Unavailable
The web part is stopped on all my indexers... any other way to have this info ?
Thanks
Sorry, should have mentioned what was pretty obvious to me, the rest command you should have run on the MC - properly configured MC should have access to all your components. But you should have called the rest call from MC _against_ your CM.
But still if you're stuck in that "no candidates" state, I'd suggest opening a support case.
Hi,
| rest /services/cluster/manager/buckets
| where multisite_bucket=0 AND standalone=0
From MC gives me the same error messages.
Port 8089 have been tested between MC > 8089 > Indexer(s) and is open.
I think it is because the web service is off on my inderxers... But no time to dig for this actually.
Priority is to get back SF / RP to normal green.
One thing I have notice, is when I try to resync one of these 3 jobs in pending, in the drop down menu to choose where to resync the bucket, I only have indexers of the origin's site of the bucket, and only one indexer from the other site. So I can't force réplication to another indexer in the other site...
The check of the task in pending give me for the bucket this info :
----
Replication count by site :
site 1:1
site 2:7
Search count by site
site 1:1
---
Is there a way to force replication on a desired indexer on the other side ? (because Splunk list me only one and always the same indexer on the other site)
Thanks !
Thanks for all your help and advices.
I will try the rest command on the MC as u suggest tomorow, im back home now. Normally MC is well configured.
I will update after searching.
But I agree with you, I think I will have to open a case @splunk 😞
Best regards
One thing I have notice, is when I try to resync one of these 3 jobs in pending, in the drop down menu to choose where to resync the bucket, I only have indexers of the origin's site of the bucket, and only one indexer from the other site. So I can't force réplication to another indexer in the other site...
The "View Becket Details" in the details of the task in pending give me for the bucket this info :
----
Replication count by site :
site 1:1
site 2:7
Search count by site
site 1:1
---
Is there a way to force replication on a desired indexer on the other side ? (because Splunk list me only one and always the same indexer on the other site)
Thanks !
Hi Splunkers,
The origin of the problem was corrupted buckets.
In my case 3 buckets were corrupted. This is what happens when analyst push some bad search request, and have killed the splunkd deamon of some indexers up and running during the decommissinning of one of them.
I used the command (under the indexer where the bucket is and this indexer as to be stopped too) :
>>> splunk fsck repair [bucket_path] [index]
(use a "find /indexes/path | grep bucket_uid$ | grep [index's bucket]" to find his path)
That fsck confirm the problem.
In my case, the problem was not repairable.
So the decision have been made to delete these buckets.
The data were old, and very small, so the decision was made to delete them.
After that evrything went back to normal.
Problem solved.
Thanks for the help