Deployment Architecture

How to troubleshoot why an index does not have any searchable copies in cluster dashboard?

zislin
Explorer

I have a cluster with 3 indexers with a bunch of indexes. Yesterday I had issues after service restarts on cluster master. After peers joined the cluster and completed their replication to meet search and replication factors, one of the indexes still says that searchable copies is 0. Replicated copies is 2 like it should be and searchable state is set to No.

It seems that something got stuck and searchable copies are not replicating.

Does anybody know how to troubleshoot this? or to force this replication of particular index?

Thank you.

0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

What version are you on?

Try a re-add on your 3 peers:

curl -k -u USER:PASSWD https://PEER_URI:PEER_MGMT_PORT/services/cluster/slave/control/control/re-add-peer -X POST

0 Karma

zislin
Explorer

I am running 5.0.7.

so Peer_URI would be URI for Cluster master, right?

would this re-add cause any outage?

Thanks.

0 Karma

svasan_splunk
Splunk Employee
Splunk Employee

zislin,

Did this problem show up after you restarted the master? In 5.0.x, the cluster master did not track frozen buckets properly. After a restart, the master would then proceed to fix up buckets that were previously frozen. See here for more info: http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Upgradeacluster#Why_the_safe_restart_clust.... The UI in 5.0.x also reported the worst case always: ie if there was even one bucket with no searchable copy then it would report that index as having no searchable copies. The problem might just be caused by a subset of the buckets like these frozen buckets.

When you restart the master, a similar procedure as mentioned in the link for upgrades is needed. I wonder if this could be the problem in your case. Since you've already restarted the master, you cannot use that script as it is anymore since the information is already lost. But we might still be able to recover by (1) just giving the cluster enough time or (2) using search to figure out the list of buckets to be fixed and then scripting it from there.

Try this search on the master from the cli, to get a list of frozen buckets

$SPLUNK_HOME/bin/splunk search 'index=_internal component=CMMaster "remove bucket" frozen=true | dedup bid | table bid' -preview 0 > /var/tmp/frozen_buckets

and maybe also:

grep my_index /var/tmp/frozen_buckets | wc -l

to see how many such buckets show up for that index. That will tell us if this is the problem

0 Karma

zislin
Explorer

Svasan_splunk,

I ran your script, parsed for my index and got 3 bucket IDs. I had a bunch of buckets for other indexes but they are not complaining in cluster master dashboard.

What's next?

Thanks.

0 Karma

svasan_splunk
Splunk Employee
Splunk Employee

Can you try this from the cli:

for i in cat /var/tmp/frozen_buckets; do curl -d "" -k -u admin:changeme https://localhost:8089/services/cluster/master/buckets/$i/freeze; done

(EDIT: there is a backtick around cat /var/tmp/frozen_buckets not sure if that shows up in the comment)

and then:
$SPLUNK_HOME/bin/splunk _internal call /cluster/master/control/directive/commit_generation -method POST

You can use a file with just the bucket-ids from the index you care about or just do everything.

Can you let me know how that goes.

0 Karma

zislin
Explorer

It didnt work. I tried to run this command just with one bucket ID and got this results.

<msg type="ERROR">In handler 'clustermasterbuckets': failed on freeze bucket request bid=cisco~515~EE19BBFD-2AAF-40D4-8814-FDA12B92A041 err='Unknown bucket bid=cisco~515~EE19BBFD-2AAF-40D4-8814-FDA12B92A041'</msg>

I ran your first command to identify these buckets and this one is still showing.

THanks

0 Karma

svasan_splunk
Splunk Employee
Splunk Employee

This is okay. It just means that the bucket has been frozen on all nodes and removed completely from the cluster so the cluster master no longer knows about the bucket which is good for our purposes. The problem happens if the bucket is frozen only on a subset of the peers and then the master is restarted and it forgets that the bucket was frozen and proceeds to fixup since at least one valid copy of it exists

Things I would suggest:

  • Do the freeze command above for all the previously frozen buckets in that one index. (Though, if it has only 3 buckets and it hasn't fixed itself up yet, this might not be the problem. But we should probably try it in any case.)
  • Commit the next generation as indicated above.
  • Check the UI to see how things stand.
  • If that doesn't work, re-add all the peers using the command dxu_splunk mentioned. (You need to do it for all 3 indexers; do it one at a time and wait for the master dashboard UI to show that peer as UP before doing the next one.) Sometimes a state mismatch between the master and peers can cause this problem and the re-add would fix some of that.

Lets see where things stand after the above. If we still have a problem, we might need to figure out which bucket is causing the problem and so on. There are trouble shooting endpoints to check which exist on 6.0.x and above, but am not sure of 5.0.x. We might have to do it differently on 5.0.x I'll set up a 5.0.x cluster and take a look.

0 Karma

zislin
Explorer

Ok. I got the command to work. THere is a space between splunk and _internal 🙂
But the problem didnt go away.

0 Karma

zislin
Explorer

I went through these 3 buckets and on last one it did something.
Now I have an issue with commiting next generation. The command that you've provided doesnt work. Splunk_internal binary doesnt exist.

0 Karma

svasan_splunk
Splunk Employee
Splunk Employee

Also, if it is just 3 buckets, you can probably let it be for a bit and it should fix itself up.

0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

ah, i'm not sure about 5.0.7

0 Karma

svasan_splunk
Splunk Employee
Splunk Employee

Looks like that endpoint was added in 5.0.2 so that endpoint should work in 5.0.7

0 Karma

zislin
Explorer

So I have three peers, are you saying I need to readd all three of them?

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...