I have a cluster with 3 indexers with a bunch of indexes. Yesterday I had issues after service restarts on cluster master. After peers joined the cluster and completed their replication to meet search and replication factors, one of the indexes still says that searchable copies is 0. Replicated copies is 2 like it should be and searchable state is set to No.
It seems that something got stuck and searchable copies are not replicating.
Does anybody know how to troubleshoot this? or to force this replication of particular index?
Thank you.
What version are you on?
Try a re-add on your 3 peers:
curl -k -u USER:PASSWD https://PEER_URI:PEER_MGMT_PORT/services/cluster/slave/control/control/re-add-peer -X POST
I am running 5.0.7.
so Peer_URI would be URI for Cluster master, right?
would this re-add cause any outage?
Thanks.
zislin,
Did this problem show up after you restarted the master? In 5.0.x, the cluster master did not track frozen buckets properly. After a restart, the master would then proceed to fix up buckets that were previously frozen. See here for more info: http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Upgradeacluster#Why_the_safe_restart_clust.... The UI in 5.0.x also reported the worst case always: ie if there was even one bucket with no searchable copy then it would report that index as having no searchable copies. The problem might just be caused by a subset of the buckets like these frozen buckets.
When you restart the master, a similar procedure as mentioned in the link for upgrades is needed. I wonder if this could be the problem in your case. Since you've already restarted the master, you cannot use that script as it is anymore since the information is already lost. But we might still be able to recover by (1) just giving the cluster enough time or (2) using search to figure out the list of buckets to be fixed and then scripting it from there.
Try this search on the master from the cli, to get a list of frozen buckets
$SPLUNK_HOME/bin/splunk search 'index=_internal component=CMMaster "remove bucket" frozen=true | dedup bid | table bid' -preview 0 > /var/tmp/frozen_buckets
and maybe also:
grep my_index /var/tmp/frozen_buckets | wc -l
to see how many such buckets show up for that index. That will tell us if this is the problem
Svasan_splunk,
I ran your script, parsed for my index and got 3 bucket IDs. I had a bunch of buckets for other indexes but they are not complaining in cluster master dashboard.
What's next?
Thanks.
Can you try this from the cli:
for i in cat /var/tmp/frozen_buckets
; do curl -d "" -k -u admin:changeme https://localhost:8089/services/cluster/master/buckets/$i/freeze; done
(EDIT: there is a backtick around cat /var/tmp/frozen_buckets not sure if that shows up in the comment)
and then:
$SPLUNK_HOME/bin/splunk _internal call /cluster/master/control/directive/commit_generation -method POST
You can use a file with just the bucket-ids from the index you care about or just do everything.
Can you let me know how that goes.
It didnt work. I tried to run this command just with one bucket ID and got this results.
<msg type="ERROR">In handler 'clustermasterbuckets': failed on freeze bucket request bid=cisco~515~EE19BBFD-2AAF-40D4-8814-FDA12B92A041 err='Unknown bucket bid=cisco~515~EE19BBFD-2AAF-40D4-8814-FDA12B92A041'</msg>
I ran your first command to identify these buckets and this one is still showing.
THanks
This is okay. It just means that the bucket has been frozen on all nodes and removed completely from the cluster so the cluster master no longer knows about the bucket which is good for our purposes. The problem happens if the bucket is frozen only on a subset of the peers and then the master is restarted and it forgets that the bucket was frozen and proceeds to fixup since at least one valid copy of it exists
Things I would suggest:
Lets see where things stand after the above. If we still have a problem, we might need to figure out which bucket is causing the problem and so on. There are trouble shooting endpoints to check which exist on 6.0.x and above, but am not sure of 5.0.x. We might have to do it differently on 5.0.x I'll set up a 5.0.x cluster and take a look.
Ok. I got the command to work. THere is a space between splunk and _internal 🙂
But the problem didnt go away.
I went through these 3 buckets and on last one it did something.
Now I have an issue with commiting next generation. The command that you've provided doesnt work. Splunk_internal binary doesnt exist.
Also, if it is just 3 buckets, you can probably let it be for a bit and it should fix itself up.
ah, i'm not sure about 5.0.7
Looks like that endpoint was added in 5.0.2 so that endpoint should work in 5.0.7
So I have three peers, are you saying I need to readd all three of them?