Hi,
I have buckets in my pending fixup tasks on my cluster master that have status: "Cannot replicate as bucket hasn't rolled yet".
What confuses me is they are all frozen and already been deleted:
Fixup Reason
bid=XYZ removed from peer=XYZ, frozen=1
So where does a frozen bucket need to be replicated to?? or is this erros misleading?
I am not able to roll the buckets from the Cluster Master UI.
I would be happy for any hint.
Thank you
David
Hi
We are hit this after upgrade to 8.1.4 (I have heard that this same issue has found also on 8.1.3). As you said this seems to be raised when there is some buckets which primary or secondary has already frozen and then e.g. you do a rolling restart. For me it seems to be a bug. I will report this to splunk support to verify it and get it fixed on future releases.
You could temporary clean this task list by rebooting CM, but it's just temporary work around. Those will be there sooner or later again.
Here is one dashboard which I use to verify this issue.
<form>
<label>FS-SPL-Bucket-Info</label>
<fieldset submitButton="false">
<input type="time" token="timePicker">
<label>Timeperiod</label>
<default>
<earliest>-7d@h</earliest>
<latest>now</latest>
</default>
</input>
<input type="text" token="bucketGuid" searchWhenChanged="true">
<label>Bucket GUID</label>
</input>
<input type="text" token="bucketIdx">
<label>Index</label>
</input>
<input type="text" token="bucketNbr">
<label>Bucket #</label>
</input>
</fieldset>
<row>
<panel>
<title>Bucket replication status</title>
<table>
<search>
<query>| rest splunk_server=<ADD YOUR CM HERE> /services/cluster/master/buckets
| search title=$bucketIdx$~$bucketNbr$~$bucketGuid$*
| rex field=title "^(?<repl_index>[^\~]+)"
| search repl_index="*" standalone=0 frozen=*
| rename title AS bucketID
| fields bucketID peers.*.search_state *site*
| untable bucketID siteState value
| rex field=siteState "peers\.(?<search_state>[^\.]*?)\.search_state"
| rex field=siteState "\.(?<primaries_by_site>\S+)"
| rex field=siteState "\.(?<rep_count_by_site>\S+)"
| rex field=siteState "\.(?<search_count_by_site>\S+)"
| eval peerGUID=if(siteState=="primaries_by_site", value, peerGUID)
| eval site=if(siteState=="origin_site", value, site)
| eval value=if(siteState=="search_count_by_site", site + ":" + value, value)
| eval value=if(siteState=="rep_count_by_site", site + ":" + value, value)
| join type=outer peerGUID
[ rest splunk_server=<ADD YOUR CM HERE> /services/cluster/master/peers
| fields active_* host* label title status site
| eval PeerName= site + ":" + label + ":" + host_port_pair
| rename title AS peerGUID
| rename site AS peerSite
| table peerGUID PeerName peerSite ]
| eval site=if(siteState=="search_state", peerSite, site)
| eval value=if(siteState=="primaries_by_site", PeerName + ":For_" + site, value)
| eval value=if(siteState=="search_state", PeerName + ":" + value, value)
| fields - PeerName peerGUID peerSite
| chart limit=0 values(value) over bucketID by siteState</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<sampleRatio>1</sampleRatio>
</search>
<option name="count">20</option>
<option name="dataOverlayMode">none</option>
<option name="drilldown">none</option>
<option name="percentagesRow">false</option>
<option name="rowNumbers">false</option>
<option name="totalsRow">false</option>
<option name="wrap">true</option>
</table>
</panel>
</row>
<row>
<panel>
<title>Bucket lifetime</title>
<table>
<search>
<query>index=_internal sourcetype="splunkd" source="*/splunkd.log" (*_$bucketNbr$_$bucketGuid$ OR hot_v1_$bucketNbr$ OR $bucketIdx$~$bucketNbr$~$bucketGuid$) $bucketIdx$
| eval reason = coalesce(reason,Reason)
| table _time host log_level peer peer_name component bucketType event event_message search_status reason from to status path
| sort +_time</query>
<earliest>$timePicker.earliest$</earliest>
<latest>$timePicker.latest$</latest>
<sampleRatio>1</sampleRatio>
</search>
<option name="count">20</option>
<option name="dataOverlayMode">none</option>
<option name="drilldown">none</option>
<option name="percentagesRow">false</option>
<option name="refresh.display">progressbar</option>
<option name="rowNumbers">false</option>
<option name="totalsRow">false</option>
<option name="wrap">true</option>
</table>
</panel>
</row>
</form>
I'm having the same problem, did you find the solution?
Hi
We are hit this after upgrade to 8.1.4 (I have heard that this same issue has found also on 8.1.3). As you said this seems to be raised when there is some buckets which primary or secondary has already frozen and then e.g. you do a rolling restart. For me it seems to be a bug. I will report this to splunk support to verify it and get it fixed on future releases.
You could temporary clean this task list by rebooting CM, but it's just temporary work around. Those will be there sooner or later again.
Here is one dashboard which I use to verify this issue.
<form>
<label>FS-SPL-Bucket-Info</label>
<fieldset submitButton="false">
<input type="time" token="timePicker">
<label>Timeperiod</label>
<default>
<earliest>-7d@h</earliest>
<latest>now</latest>
</default>
</input>
<input type="text" token="bucketGuid" searchWhenChanged="true">
<label>Bucket GUID</label>
</input>
<input type="text" token="bucketIdx">
<label>Index</label>
</input>
<input type="text" token="bucketNbr">
<label>Bucket #</label>
</input>
</fieldset>
<row>
<panel>
<title>Bucket replication status</title>
<table>
<search>
<query>| rest splunk_server=<ADD YOUR CM HERE> /services/cluster/master/buckets
| search title=$bucketIdx$~$bucketNbr$~$bucketGuid$*
| rex field=title "^(?<repl_index>[^\~]+)"
| search repl_index="*" standalone=0 frozen=*
| rename title AS bucketID
| fields bucketID peers.*.search_state *site*
| untable bucketID siteState value
| rex field=siteState "peers\.(?<search_state>[^\.]*?)\.search_state"
| rex field=siteState "\.(?<primaries_by_site>\S+)"
| rex field=siteState "\.(?<rep_count_by_site>\S+)"
| rex field=siteState "\.(?<search_count_by_site>\S+)"
| eval peerGUID=if(siteState=="primaries_by_site", value, peerGUID)
| eval site=if(siteState=="origin_site", value, site)
| eval value=if(siteState=="search_count_by_site", site + ":" + value, value)
| eval value=if(siteState=="rep_count_by_site", site + ":" + value, value)
| join type=outer peerGUID
[ rest splunk_server=<ADD YOUR CM HERE> /services/cluster/master/peers
| fields active_* host* label title status site
| eval PeerName= site + ":" + label + ":" + host_port_pair
| rename title AS peerGUID
| rename site AS peerSite
| table peerGUID PeerName peerSite ]
| eval site=if(siteState=="search_state", peerSite, site)
| eval value=if(siteState=="primaries_by_site", PeerName + ":For_" + site, value)
| eval value=if(siteState=="search_state", PeerName + ":" + value, value)
| fields - PeerName peerGUID peerSite
| chart limit=0 values(value) over bucketID by siteState</query>
<earliest>-24h@h</earliest>
<latest>now</latest>
<sampleRatio>1</sampleRatio>
</search>
<option name="count">20</option>
<option name="dataOverlayMode">none</option>
<option name="drilldown">none</option>
<option name="percentagesRow">false</option>
<option name="rowNumbers">false</option>
<option name="totalsRow">false</option>
<option name="wrap">true</option>
</table>
</panel>
</row>
<row>
<panel>
<title>Bucket lifetime</title>
<table>
<search>
<query>index=_internal sourcetype="splunkd" source="*/splunkd.log" (*_$bucketNbr$_$bucketGuid$ OR hot_v1_$bucketNbr$ OR $bucketIdx$~$bucketNbr$~$bucketGuid$) $bucketIdx$
| eval reason = coalesce(reason,Reason)
| table _time host log_level peer peer_name component bucketType event event_message search_status reason from to status path
| sort +_time</query>
<earliest>$timePicker.earliest$</earliest>
<latest>$timePicker.latest$</latest>
<sampleRatio>1</sampleRatio>
</search>
<option name="count">20</option>
<option name="dataOverlayMode">none</option>
<option name="drilldown">none</option>
<option name="percentagesRow">false</option>
<option name="refresh.display">progressbar</option>
<option name="rowNumbers">false</option>
<option name="totalsRow">false</option>
<option name="wrap">true</option>
</table>
</panel>
</row>
</form>
2021-06-01 | SPL-206510, SPL-213903, SPL-213901, SPL-213902 | CM issues fixup tasks for "frozen in cluster" clustered buckets Workaround: Workaround 1: Restart CM Workaround 2: Use a REST search on the CM to only show fixup for non-frozen bucket. In versions lower than 8.2.0, the endpoint name is master, not manager. | rest splunk_server=local /services/cluster/manager/fixup level=replication_factor | table index title initial.reason latest.reason level | append [| rest splunk_server=local /services/cluster/manager/fixup level=search_factor | table index title initial.reason latest.reason level ] | join title [| rest splunk_server=local /services/cluster/manager/buckets f=title f=frozen search="frozen=0"] | stats values(level) AS level values(frozen) AS frozen values(initial.reason) AS initial.reason values(latest.reason) AS latest.reason BY index title |