Solved: Re: why is the cluster master not able to fixup bu...

rphillips_splk · ‎01-11-2019

Problem:
My cluster master is reporting fixup tasks under the bucket status , > Generation tab with status "cannot fix up search factor as bucket is not serviceable", however these buckets are never getting fixed.

rphillips_splk · ‎01-11-2019

when the | delete command is issued in a search, data isn't actually deleted from disk but splunk creates a "deletes" directory and will not return those events in search.
ie:
on indexer:
$SPLUNK_HOME/var/lib/splunk/defaultdb/db/db_1546932848_1546891149_15_4720CDA9-5F9B-4CE1-BB0D-10A6F555A1E4/rawdata/deletes
[root@indexer01 deletes]# zcat 38602ccf63e998fa1823f9f664055448.csv.gz
timestamp,event_address,type_id,host_id,source_id,sourcetype_id
1546932848,1846,0,2,1,1
1546932848,1844,0,2,1,1
1546932848,1842,0,2,1,1
1546932848,1840,0,2,1,1
1546932848,1838,0,2,1,1
1546932848,1836,0,2,1,1
1546932848,1834,0,2,1,1

first the primary bucket will now have the "deletes" directory

All peers which hold this bucket need to have the "deletes" directory in sync

The peer holding the primary bucket will update its checksum and update the cluster master

subsequently, the peer will initiate a sync request (peer to peer) to update the other peers holding this bucket and this sync happens over port 8089 between peers

If port 8089 is not open between indexers the sync request will fail between peers and you will have buckets in this state where they are in a fixup loop and never complete the fixup.

We see this in the CM fixup in the generation tab which shows "cannot fix up search factor as bucket is not serviceable"

if you see a log msg on the indexer in splunkd.log like the one below , most likely port 8089 (splunk mgmt default port) is not open between indexers and it needs to be:

01-08-2019 16:15:57.292 -0800 ERROR CMRepJob - job=CMSyncP2PJob bid= my_guid= my_rawport=9887 my_usessl=0 ot_guid= ot_hp=10.10.10.1:8089 ot_rawport=9887 ot_usessl=0 relative_path= custact=p2p_syncup getHttpReply failed; err: Connect Timeout

Once that port is opened the fixup tasks should complete and get remove from the CM fixup activities

View solution in original post

rphillips_splk · ‎01-11-2019

splunkd.log shows : ERROR CMRepJob - job=CMSyncP2PJob

jkat54 · ‎01-11-2019

I’ve seen this before when frozen buckets were restored to just one of two indexers in their cluster.

Buckets in the thaweddb path are “not serviceable” because by placing them in thawed you’re telling splunk you don’t want them to be deleted. Splunk is also not going to replicate thawed buckets because that would be a mess. So then thawed buckets will also show as unserviceable.

I mention this because the solution for not serviceable thawed buckets would be different from the solution that worked above. In case someone comes with very similar issue but different situation.

rphillips_splk · ‎01-11-2019