About richardgosnay

richardgosnay · ‎05-27-2021

I have now pushed new configs with updated maxTotalDataSizeMB and maxDataSizeMB for both home and cold path, still the Current Size is greater and shows no signs of reducing. I have two environments, one in Belgium and one in Australia, each are installed almost identically. The Australian one works perfectly, all indexes are kept below their max size settings, however the Belgium one refuses to budge. Am I missing something? Is there a setting I don't know about? Any suggest would be greatly appreciated. Thank you Here is an example of an index I've capped at 30Gb (10Gb for hot/warm and 20Gb for cold) [index-1] homePath = $SPLUNK_DB\index-1\db coldPath = $SPLUNK_DB\index-1\colddb thawedPath = $SPLUNK_DB\index-1\thaweddb repFactor = auto enableDataIntegrityControl = 0 enableTsidxReduction = 0 maxTotalDataSizeMB = 30720 homePath.maxDataSizeMB = 10240 coldPath.maxDataSizeMB = 20480 bucketRebuildMemoryHint = 0 compressRawdata = 1 enableOnlineBucketRepair = 1 minHotIdleSecsBeforeForceRoll = 0 rtRouterQueueSize = rtRouterThreads = suspendHotRollByDeleteQuery = 0 syncMeta = 1 disabled = 0

richardgosnay · ‎05-10-2021

Hi Splunk Folk, I've spent most of the morning trying to find this with no luck, I've seen some similar posts but none of the solutions work for me. Why is the "Current Size" greater than the "Max Size" for several indexes which reside on a cluster? Here is an example of the indexes.conf file that my master is pushing out for a 100Gb Max Size index. [index_name] homePath = $SPLUNK_DB\index_name\db coldPath = $SPLUNK_DB\index_name\colddb thawedPath = $SPLUNK_DB\index_name\thaweddb repFactor = auto enableDataIntegrityControl = 0 enableTsidxReduction = 0 maxTotalDataSizeMB = 102400 bucketRebuildMemoryHint = 0 compressRawdata = 1 enableOnlineBucketRepair = 1 minHotIdleSecsBeforeForceRoll = 0 suspendHotRollByDeleteQuery = 0 syncMeta = 1 disabled = 0 I haven't tried changing the maxTotalDataSizeMB value and pushing out new configs yet because I wanted to understand why it's doing this in the first place? Any ideas?

richardgosnay · ‎05-04-2021

Hey Splunk Friends, I currently have 32 indexes spread across 2 peers managed by 1 master. The total space for these indexes has now reached just under 3,000Gb (one of the indexes alone is 1,486Gb). We don't really have any performance issues at present, but when the Splunk machines get restarted for any reason, it does take some time for the Indexes to catch up (Replication Factor, Search, etc). On the odd occasion, if there has been an issue which lasted longer, it has caused us to see bucket issues. My question, is 32 indexes (3000Gb) too much for one cluster (two peers)? If so, should I create another cluster? Or add additional peers?

richardgosnay · ‎04-12-2021

Hi Guys, I have: 1 x Search Node 1 x Master Node - 2 x Peer Nodes 1 x Deployment Node I've updated the master_uri & pass4SymmKey in the Search node and restarted it Splunk via the GUI, this worked fine and the license page is showing the new values. However, I am a bit reluctant to just change/restart the other nodes for fear of any bucket/replication issues. Am I ok to update the Master Node the same and perform a normal restart? Then update the Peer Nodes and perform a Rolling Restart? If not, what it is the best way to apply the conf changes and apply them? I did try searching the documentation but I got a bit lost. Thank you in advance.

richardgosnay · ‎04-07-2021

That worked perfectly, thank you...

richardgosnay · ‎04-07-2021

Thank you greatly, I will be performing the peer restarts in around 7 hours time. I'll let you know if it works and upvote accordingly.

richardgosnay · ‎04-07-2021

The bucket ID is the same, but the range at the beginning is not. Example Bucket ID: _audit~110~25359C10-2544-436D-893A-657C950D7863 Peer 1 Folder Name: rb_1614134403_1612300619_110_25359C10-2544-436D-893A-657C950D7863 Peer 2 Folder Name: db_1614134587_1612300619_110_25359C10-2544-436D-893A-657C950D7863 All neighbouring folder names match perfectly, it's only the buckets in question that don't match. If I remove the RB folder, will it get re-created with the correct DB equivalent?

richardgosnay · ‎04-06-2021

It seems, the only buckets affected are the replicated ones (rb instead of db). If I manually removed these before restarting the Cluster Master (and peers) will they just be re-created?

richardgosnay · ‎04-06-2021

The only errors in Splunk are the same as the ones in the splunkd.log file, you can see the snippet in the original post.

richardgosnay · ‎04-06-2021

If I manually locate the buckets in question, they don't have inflight- in the filename, they appear as normal buckets. But every time I try to run a fix up task like Roll, Resync or Delete, the log files states it is in flight (see previous log snippet). Should I try running the fix up tasks in maintenance mode?

richardgosnay · ‎04-06-2021

Hi, I'm currently running Splunk 7.3.0 and have 32 indexes running in a single cluster with 2 peers. Indexes are being replicated across both peers. Everything was working fine until we experienced a network blip 12 days ago, now I've noticed that the Replication Factor is not being met because there are some buckets from this time period which don't match, an average of about 3 buckets. I've tried to Roll, Resync and Delete these buckets via the GUI but each step fails. When I check splunkd.log, it appears as if Splunk is automatically trying to recover from these Fix Up tasks but it keeps reporting that the bucket is still in flight so can't. 04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete 04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket" 04-06-2021 08:07:39.618 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.618 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.618 +0100 INFO CMSlave - bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket" 04-06-2021 08:07:39.618 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight' 04-06-2021 08:07:39.618 +0100 INFO CMSlave - truncate request bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 current bid status=Complete 04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=Complete to=PendingDiscard for reason="schedule delete bucket" 04-06-2021 08:07:39.619 +0100 WARN CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.619 +0100 ERROR CMSlave - event=scheduleDeleteBucket, bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucket already in flight 04-06-2021 08:07:39.619 +0100 INFO CMSlave - bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 Transitioning status from=PendingDiscard to=Complete for reason="failed to schedule delete bucket" 04-06-2021 08:07:39.619 +0100 ERROR ClusterSlaveBucketHandler - truncate bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bytes=0x0 earliest=0 latest=0 err='bucket already in flight' 04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19028~25359C10-2544-436D-893A-657C950D7863 bucketExists=1 04-06-2021 08:07:39.620 +0100 INFO CMSlave - Received resync bucket request for bid=bel1_qa_apps~19090~25359C10-2544-436D-893A-657C950D7863 bucketExists=1 Because of this, the Generation ID is also increasing quite rapidly. The status for all the buckets in question is stuck on 'PendingDiscard'. The same messages are appearing on the second node but with different bucket IDs. The same ID's keep repeating every few seconds on both peers. Should I restart each peer one at a time in hope that the bucket status is released and the fix up jobs can run as normal? Do I need to restart the cluster master? Any advice is appreciated. Thank you

Posts	11
Solutions	0
Karma Given	3
Karma Received	0
Member Since	‎04-06-2021

Online Status	Offline
Date Last Visited	‎01-10-2022 02:30 PM

Why is my Index Current Size greater than my Max S...

How Many Indexes Per Peer (Overall size)

Restart order/procedure after updating server.conf...

Cluster Index Bucket Stuck as "In Flight" - Roll, ...

Re: Why is my Index Current Size greater than my M...

Why is my Index Current Size greater than my Max S...

How Many Indexes Per Peer (Overall size)

Restart order/procedure after updating server.conf...

Re: Cluster Index Bucket Stuck as "In Flight" - Ro...

Re: Cluster Index Bucket Stuck as "In Flight" - Ro...

Re: Cluster Index Bucket Stuck as "In Flight" - Ro...

Re: Cluster Index Bucket Stuck as "In Flight" - Ro...

Re: Cluster Index Bucket Stuck as "In Flight" - Ro...

Re: Cluster Index Bucket Stuck as "In Flight" - Ro...

Cluster Index Bucket Stuck as "In Flight" - Roll, ...

Are you a member of the Splunk Community?