How to troubleshoot why an index does not have any...

zislin · ‎12-18-2014

I have a cluster with 3 indexers with a bunch of indexes. Yesterday I had issues after service restarts on cluster master. After peers joined the cluster and completed their replication to meet search and replication factors, one of the indexes still says that searchable copies is 0. Replicated copies is 2 like it should be and searchable state is set to No.

It seems that something got stuck and searchable copies are not replicating.

Does anybody know how to troubleshoot this? or to force this replication of particular index?

Thank you.

dxu_splunk · ‎12-18-2014

What version are you on?

Try a re-add on your 3 peers:

curl -k -u USER:PASSWD https://PEER_URI:PEER_MGMT_PORT/services/cluster/slave/control/control/re-add-peer -X POST

zislin · ‎12-18-2014

I am running 5.0.7.

so Peer_URI would be URI for Cluster master, right?

would this re-add cause any outage?

Thanks.

svasan_splunk · ‎12-18-2014

zislin,

Did this problem show up after you restarted the master? In 5.0.x, the cluster master did not track frozen buckets properly. After a restart, the master would then proceed to fix up buckets that were previously frozen. See here for more info: http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Upgradeacluster#Why_the_safe_restart_clust.... The UI in 5.0.x also reported the worst case always: ie if there was even one bucket with no searchable copy then it would report that index as having no searchable copies. The problem might just be caused by a subset of the buckets like these frozen buckets.

When you restart the master, a similar procedure as mentioned in the link for upgrades is needed. I wonder if this could be the problem in your case. Since you've already restarted the master, you cannot use that script as it is anymore since the information is already lost. But we might still be able to recover by (1) just giving the cluster enough time or (2) using search to figure out the list of buckets to be fixed and then scripting it from there.

Try this search on the master from the cli, to get a list of frozen buckets

$SPLUNK_HOME/bin/splunk search 'index=_internal component=CMMaster "remove bucket" frozen=true | dedup bid | table bid' -preview 0 > /var/tmp/frozen_buckets

and maybe also:

grep my_index /var/tmp/frozen_buckets | wc -l

to see how many such buckets show up for that index. That will tell us if this is the problem

zislin · ‎12-18-2014

Svasan_splunk,

I ran your script, parsed for my index and got 3 bucket IDs. I had a bunch of buckets for other indexes but they are not complaining in cluster master dashboard.

What's next?

Thanks.

svasan_splunk · ‎12-18-2014

Can you try this from the cli:

for i in cat /var/tmp/frozen_buckets; do curl -d "" -k -u admin:changeme https://localhost:8089/services/cluster/master/buckets/$i/freeze; done

(EDIT: there is a backtick around cat /var/tmp/frozen_buckets not sure if that shows up in the comment)

and then:
$SPLUNK_HOME/bin/splunk _internal call /cluster/master/control/directive/commit_generation -method POST

You can use a file with just the bucket-ids from the index you care about or just do everything.

Can you let me know how that goes.

zislin · ‎12-19-2014

It didnt work. I tried to run this command just with one bucket ID and got this results.

<msg type="ERROR">In handler 'clustermasterbuckets': failed on freeze bucket request bid=cisco~515~EE19BBFD-2AAF-40D4-8814-FDA12B92A041 err='Unknown bucket bid=cisco~515~EE19BBFD-2AAF-40D4-8814-FDA12B92A041'</msg>

I ran your first command to identify these buckets and this one is still showing.

THanks

svasan_splunk · ‎12-19-2014

This is okay. It just means that the bucket has been frozen on all nodes and removed completely from the cluster so the cluster master no longer knows about the bucket which is good for our purposes. The problem happens if the bucket is frozen only on a subset of the peers and then the master is restarted and it forgets that the bucket was frozen and proceeds to fixup since at least one valid copy of it exists

Things I would suggest:

Do the freeze command above for all the previously frozen buckets in that one index. (Though, if it has only 3 buckets and it hasn't fixed itself up yet, this might not be the problem. But we should probably try it in any case.)
Commit the next generation as indicated above.
Check the UI to see how things stand.
If that doesn't work, re-add all the peers using the command dxu_splunk mentioned. (You need to do it for all 3 indexers; do it one at a time and wait for the master dashboard UI to show that peer as UP before doing the next one.) Sometimes a state mismatch between the master and peers can cause this problem and the re-add would fix some of that.

Lets see where things stand after the above. If we still have a problem, we might need to figure out which bucket is causing the problem and so on. There are trouble shooting endpoints to check which exist on 6.0.x and above, but am not sure of 5.0.x. We might have to do it differently on 5.0.x I'll set up a 5.0.x cluster and take a look.

zislin · ‎12-23-2014

Ok. I got the command to work. THere is a space between splunk and _internal 🙂
But the problem didnt go away.

zislin · ‎12-23-2014

I went through these 3 buckets and on last one it did something.
Now I have an issue with commiting next generation. The command that you've provided doesnt work. Splunk_internal binary doesnt exist.

svasan_splunk · ‎12-18-2014

Also, if it is just 3 buckets, you can probably let it be for a bit and it should fix itself up.

dxu_splunk · ‎12-18-2014

ah, i'm not sure about 5.0.7

svasan_splunk · ‎12-18-2014

Looks like that endpoint was added in 5.0.2 so that endpoint should work in 5.0.7

zislin · ‎12-18-2014

So I have three peers, are you saying I need to readd all three of them?

How to troubleshoot why an index does not have any searchable copies in cluster dashboard?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits

Join the Conversation

How to troubleshoot why an index does not have any searchable copies in cluster dashboard?

Index This | What is broken 80% of the time by February?

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Splunk MCP & Agentic AI: Machine Data Without Limits