Deployment Architecture

Help with hot buckets rolling prematurely

ShaneNewman
Motivator

Hello fellow Splunkers,

I am trying to get to the bottom of issue I am having on an index cluster, specifically around hot buckets rolling before they should. I have read all of the docs around this and the expected behavior is not lining up with what I would expect. I am not investigating this because of retention purposes, I have made adjustments to the maxWarmDBCount, as well as homePath.maxDataSizeMB settings to ensure account for that. The issue I am seeing is more around having over 20,000 buckets for some indexes that don’t get much data. By now I am guessing that many of you are wondering why is this a problem. The problem I am trying to solve is long rolling-restart times on the index cluster. Because of the time the indexer spend registering bid’s with the cluster master, it can take upwards of 2 hours for a single indexer to rejoin the cluster. This makes routine maintenance a bit of a struggle as the restart takes 10-12 hours to complete. In total there are 14 indexes that I am having this problem with, where the bucket count is abnormally high for the volume of data being ingested which is dramatically increasing the time it takes for the peers to become searchable in the cluster following a restart. For the purposes of this post, I am going to use the windows index.

Background that might be useful:

Index Cluster Specs
OS: RHEL 6.9 64-bit
CPU: 32 cores
RAM: 64GB
Local Disk: 12TB usable RAID 1+0 (20K IOPS) - used for homePath
Remote Disk: 40TB - used for coldPath (1200 IOPS)
Network: 10GB

12 Node Index Cluster (2 sites, SF=2, RF=3)

Windows data:
Average: 19GB/day
Peak: 25GB/day

Index Cluster Maintenance:
Tuesdays & Thursdays - only as required for adding indexes/props/transforms (I know this causes the bucket to roll so we do this sparingly)

indexes.conf config:

[windows]
maxDataSize = auto
maxHotBuckets = 3
maxWarmDBCount = 1200
homePath.maxDataSizeMB = 178000
homePath   = volume:primary/windows/db
coldPath   = volume:coldstorage/windows/colddb
thawedPath = $SPLUNK_DB/windows/thaweddb
repFactor = auto

btool output:

[windows]
assureUTF8 = false
bucketRebuildMemoryHint = auto
coldPath = volume:coldstorage/windows/colddb
coldPath.maxDataSizeMB = 0
coldToFrozenDir = 
coldToFrozenScript = 
compressRawdata = true
defaultDatabase = main
enableDataIntegrityControl = false
enableOnlineBucketRepair = true
enableRealtimeSearch = true
enableTsidxReduction = false
frozenTimePeriodInSecs = 31536000
homePath = volume:primary/windows/db
homePath.maxDataSizeMB = 178000
hotBucketTimeRefreshInterval = 10
indexThreads = auto
journalCompression = gzip
maxBloomBackfillBucketAge = 30d
maxBucketSizeCacheEntries = 0
maxConcurrentOptimizes = 6
maxDataSize = auto
maxHotBuckets = 3
maxHotIdleSecs = 0
maxHotSpanSecs = 7776000
maxMemMB = 5
maxMetaEntries = 1000000
maxRunningProcessGroups = 8
maxRunningProcessGroupsLowPriority = 1
maxTimeUnreplicatedNoAcks = 300
maxTimeUnreplicatedWithAcks = 60
maxTotalDataSizeMB = 10000000
maxWarmDBCount = 1200
memPoolMB = auto
minRawFileSyncSecs = disable
minStreamGroupQueueSize = 2000
partialServiceMetaPeriod = 0
processTrackerServiceInterval = 1
quarantineFutureSecs = 2592000
quarantinePastSecs = 77760000
rawChunkSizeBytes = 131072
repFactor = auto
rotatePeriodInSecs = 60
serviceMetaPeriod = 25
serviceOnlyAsNeeded = true
serviceSubtaskTimingPeriod = 30
streamingTargetTsidxSyncPeriodMsec = 5000
suppressBannerList = 
sync = 0
syncMeta = true
thawedPath = $SPLUNK_DB/windows/thaweddb
throttleCheckPeriod = 15
timePeriodInSecBeforeTsidxReduction = 604800
tsidxReductionCheckPeriodInSec = 600
tstatsHomePath = volume:_splunk_summaries/$_index_name/datamodel_summary
warmToColdScript = 

Based on these configurations, with maxDataSize = auto, I would expect hot buckets to get to around 750MB before rolling - since the maxHotSpanSecs = 7776000 which is 90 days. What I am actually seeing is that the buckets are rolling before they reach 100MB in most cases, some manage to make it to 400MB but that is the largest I see them get before they roll. This has resulted in a bucket count for this index of over 10,000 buckets. Each indexer has between 3,000 and 3,500 buckets. During the restart, roughly 3 buckets are registered per second - meaning just registering the windows index buckets with the cluster master takes almost 20 minutes. I have researched the _internal index looking for causality of the hot buckets rolling prematurely and I do not see anything useful:

07-27-2017 04:38:17.008 +0000 INFO  IndexWriter - idx=windows Creating hot bucket=hot_v1_1974, given event timestamped=1501126613
07-27-2017 04:38:17.008 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Adding bucket, bid=windows~1974~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 04:44:58.994 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Adding bucket, bid=windows~6~4156302F-A116-477D-B637-37532D2BA1B3'
07-27-2017 05:25:10.048 +0000 INFO  ProcessTracker - (child_137__Fsck)  Fsck - (bloomfilter only) Rebuild for bucket='/opt/splunk/var/lib/splunk/windows/db/db_1501126138_1501060973_1973_8AEF590F-9B95-451A-B7FD-AF5233075C49' took 977.2 milliseconds
07-27-2017 05:25:11.582 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=1 .  Reason='Updating manifest: bucketUpdates=1'
07-27-2017 05:25:16.726 +0000 INFO  ProcessTracker - (child_138__Fsck)  Fsck - (bloomfilter only) Rebuild for bucket='/opt/splunk/var/lib/splunk/windows/db/db_1501060972_1500929755_1972_8AEF590F-9B95-451A-B7FD-AF5233075C49' took 5.586 seconds
07-27-2017 05:25:17.582 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=1 .  Reason='Updating manifest: bucketUpdates=1'
07-27-2017 05:25:29.983 +0000 INFO  ProcessTracker - (child_139__Fsck)  Fsck - (bloomfilter only) Rebuild for bucket='/opt/splunk/var/lib/splunk/windows/db/db_1500549595_1495828554_1968_8AEF590F-9B95-451A-B7FD-AF5233075C49' took 12.92 seconds
07-27-2017 05:25:30.581 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=1 .  Reason='Updating manifest: bucketUpdates=1'
07-27-2017 05:25:46.098 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Buckets were rebuilt or tsidx-minified (bucket_count=3).'
07-27-2017 10:59:18.818 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Updating bucket, bid=windows~1974~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 10:59:18.879 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Updating bucket, bid=windows~1974~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 10:59:19.662 +0000 INFO  IndexWriter - idx=windows Creating hot bucket=hot_v1_1975, given event timestamped=1501153150
07-27-2017 10:59:19.663 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Adding bucket, bid=windows~1975~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 10:59:53.266 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Adding bucket, bid=windows~965~67EA25F7-D999-4D31-8A46-7EF1B0B281AC'
07-27-2017 10:59:54.694 +0000 INFO  BucketMover - idx=windows Moving bucket='rb_1497113527_1497042556_1586_78CBDAF4-4C0B-4493-8361-047DCAFF4113' because maximum number of warm databases exceeded, starting warm_to_cold: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 10:59:54.694 +0000 INFO  BucketMover - idx=windows bucket=rb_1497113527_1497042556_1586_78CBDAF4-4C0B-4493-8361-047DCAFF4113 Firing async chiller: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 10:59:58.504 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Adding bucket, bid=windows~2432~2B7D2EBA-D465-457F-ACB2-DF6CBAE09797'
07-27-2017 11:00:13.745 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Updating bucket, bid=windows~1586~78CBDAF4-4C0B-4493-8361-047DCAFF4113'
07-27-2017 11:01:45.592 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Adding bucket, bid=windows~1306~48BA7BF9-6D4B-47B9-A971-572507094A78'
07-27-2017 11:02:34.123 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Updating bucket, bid=windows~965~67EA25F7-D999-4D31-8A46-7EF1B0B281AC'
07-27-2017 11:02:35.407 +0000 INFO  BucketMover - idx=windows Moving bucket='rb_1497127317_1497060191_1154_F1E6A997-949C-48AA-A3E4-48762822B93A' because maximum number of warm databases exceeded, starting warm_to_cold: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 11:02:35.407 +0000 INFO  BucketMover - idx=windows bucket=rb_1497127317_1497060191_1154_F1E6A997-949C-48AA-A3E4-48762822B93A Firing async chiller: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 11:02:35.857 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Buckets were rebuilt or tsidx-minified (bucket_count=2).'
07-27-2017 11:02:35.964 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Updating bucket, bid=windows~1975~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 11:02:35.997 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Updating bucket, bid=windows~1975~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 11:02:44.529 +0000 INFO  DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 .  Reason='Updating bucket, bid=windows~2432~2B7D2EBA-D465-457F-ACB2-DF6CBAE09797'

Does anyone out there have any additional suggestions for how to troubleshoot this issue?

0 Karma
1 Solution

sowings
Splunk Employee
Splunk Employee

This is support case territory; your CM sounds overloaded, and you may also have network connectivity issues. The most common reason I see for "premature roll" in an indexing cluster, when it's not about too many hot buckets is "I lost contact with the replication target, eek!" I'd pursue that avenue.

View solution in original post

Shephali
Explorer

@ShaneNewman : Hi, did you finally conclude any possible reason for it? Facing a similar situation at my end
Thanks!

sowings
Splunk Employee
Splunk Employee

This is support case territory; your CM sounds overloaded, and you may also have network connectivity issues. The most common reason I see for "premature roll" in an indexing cluster, when it's not about too many hot buckets is "I lost contact with the replication target, eek!" I'd pursue that avenue.

ShaneNewman
Motivator

Sandy,

I figured you would chime in on this at some point. I have been using fire brigade to configure the settings on the indexes the past few months and it has been very helpful. I thought of that particular scenario though. We have had some network connectivity issues over the last few months and the CM was a bit overloaded, we have since resolved those issues and the problem has lessened but it is still happening.

I am going to update the log levels on these to debug to see if I can glean any additional insight:

category.BucketMover
category.BucketReplicator
category.DatabaseDirectoryManager
category.HotBucketRoller
category.HotDBManager
category.IndexerIf
category.IndexWriter

Are there any others you would suggest turning up? If I cannot find anything after doing this I will submit a case and involve the "Andrews".

0 Karma

ShaneNewman
Motivator

Sanford,

Are you aware of any tuning that I can do on the CM/Indexers to reduce the amount of time it takes buckets to register, aside from the obvious reduce the amount of buckets?

0 Karma

mattymo
Splunk Employee
Splunk Employee

Hi ShaneNewman,

How are your volumes looking utilization-wise?

Any hints in the earliest and latest times in | dbinspect for these buckets? Maybe some bad timestamps causing havoc?

Management Console provides some pretty nice views under indexing > indexes and volumes / index detail, any hints there?

- MattyMo
0 Karma

ShaneNewman
Motivator

Hey mmodestino,

Utilization on the indexes for the homePath are very low, less than 50% in some cases. I update the index configs every month based on the prior month's trends to get as close to 100% utilization as I can of the homePath. I have been using the DMC and sowings FireBrigade app to research this issue. Nothing overly helpful in finding the "why" this is happening, just things pointing out the fact that it is happening.

0 Karma

mattymo
Splunk Employee
Splunk Employee

yeah sowings kows a thing er two about a thing er two 😉

Sounds like support should have a look..what splunk version?

- MattyMo
0 Karma

ShaneNewman
Motivator

Version 6.4.4

0 Karma

kellewic
Path Finder

I don't see any logs for "HotBucketRoller"; look for those specifically in "index=_internal sourcetype=splunkd". That will give the reason a hot bucket rolls to warm.

0 Karma

ShaneNewman
Motivator

These are the events from the indexer:

07-27-2017 11:11:19.749 +0000 INFO  HotBucketRoller - finished moving hot to warm bid=windows~1977~8AEF590F-9B95-451A-B7FD-AF5233075C49 idx=windows from=hot_v1_1977 to=db_1501153769_1501153151_1977_8AEF590F-9B95-451A-B7FD-AF5233075C49 size=118784 caller=marked isinit=false selective=true
    07-27-2017 11:05:55.625 +0000 INFO  HotBucketRoller - finished moving hot to warm bid=windows~1976~8AEF590F-9B95-451A-B7FD-AF5233075C49 idx=windows from=hot_v1_1976 to=db_1501153439_1501153151_1976_8AEF590F-9B95-451A-B7FD-AF5233075C49 size=53248 caller=clusterslavecontrol_rollhotbucket isinit=false selective=false
    07-27-2017 11:02:36.030 +0000 INFO  HotBucketRoller - finished moving hot to warm bid=windows~1975~8AEF590F-9B95-451A-B7FD-AF5233075C49 idx=windows from=hot_v1_1975 to=db_1501153241_1501153150_1975_8AEF590F-9B95-451A-B7FD-AF5233075C49 size=65536 caller=marked isinit=false selective=true
    07-27-2017 10:59:18.911 +0000 INFO  HotBucketRoller - finished moving hot to warm bid=windows~1974~8AEF590F-9B95-451A-B7FD-AF5233075C49 idx=windows from=hot_v1_1974 to=db_1501153150_1497734091_1974_8AEF590F-9B95-451A-B7FD-AF5233075C49 size=81895424 caller=cmslave_rolling_enabled isinit=false selective=false

From the cluster Master:

07-27-2017 11:33:34.585 +0000 INFO  CMBucket - set bucket summary bid=windows~1977~8AEF590F-9B95-451A-B7FD-AF5233075C49 summaryId=4C4DE6D8-2AEA-4335-92A3-74850587B037_Splunk_CiscoSecuritySuite_nobody_NSa08beba97eb758ed peer=67EA25F7-D999-4D31-8A46-7EF1B0B281AC type=report_acceleration state=done modtime=1501155208.000000 checksum=AB5C4BFA8BBEAB04F369013C675A499EA9AD9626A13E3BDC216DA8A70DE345FB

Is there something I should be looking for in particular on the cluster master?

0 Karma

sowings
Splunk Employee
Splunk Employee

component=CMPeer transitioning

0 Karma

sowings
Splunk Employee
Splunk Employee

Smells like an overloaded CM with too many buckets (750MB, really?!) possibly overrunning its service_interval and thereby plugging its ears when an addPeer comes in. This could cause disruptions in replication too, imo.

0 Karma

ShaneNewman
Motivator

The CM doesnt appear to be overloaded from a resource perspective. CPU is averaging 80% idle over the last week with at least 4GB/RAM free. Load is around 1.2.

Whats wrong with 750MB hot buckets? Getting 20GB per day/12 = 1.7GB per indexer. 3X750MB hot buckets is ~2GB... seemed pretty close to me.

0 Karma

kellewic
Path Finder

I've never seen those callers. The only reasons I can think of for a hot to warm roll are size, time, idle, max hot buckets hit; or someone forced it through a restart, GUI, or REST endpoint.

For HotBucketRoller messages, all I ever see are "caller=lru", "caller="size_exceeded", and "caller=idle".

Are you sure the times in the events aren't spread out all over the place?

0 Karma

ShaneNewman
Motivator

I just verified for the hot buckets recently rolled that the events are pretty well grouped, some were a week or two old but for the most part they are only days old. I do see some messages about max number of hot buckets and it adding a new one. I cannot find any reasons to rolling the bucket other than that.

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...