Hello fellow Splunkers,
I am trying to get to the bottom of issue I am having on an index cluster, specifically around hot buckets rolling before they should. I have read all of the docs around this and the expected behavior is not lining up with what I would expect. I am not investigating this because of retention purposes, I have made adjustments to the maxWarmDBCount, as well as homePath.maxDataSizeMB settings to ensure account for that. The issue I am seeing is more around having over 20,000 buckets for some indexes that don’t get much data. By now I am guessing that many of you are wondering why is this a problem. The problem I am trying to solve is long rolling-restart times on the index cluster. Because of the time the indexer spend registering bid’s with the cluster master, it can take upwards of 2 hours for a single indexer to rejoin the cluster. This makes routine maintenance a bit of a struggle as the restart takes 10-12 hours to complete. In total there are 14 indexes that I am having this problem with, where the bucket count is abnormally high for the volume of data being ingested which is dramatically increasing the time it takes for the peers to become searchable in the cluster following a restart. For the purposes of this post, I am going to use the windows index.
Background that might be useful:
Index Cluster Specs
OS: RHEL 6.9 64-bit
CPU: 32 cores
RAM: 64GB
Local Disk: 12TB usable RAID 1+0 (20K IOPS) - used for homePath
Remote Disk: 40TB - used for coldPath (1200 IOPS)
Network: 10GB
12 Node Index Cluster (2 sites, SF=2, RF=3)
Windows data:
Average: 19GB/day
Peak: 25GB/day
Index Cluster Maintenance:
Tuesdays & Thursdays - only as required for adding indexes/props/transforms (I know this causes the bucket to roll so we do this sparingly)
indexes.conf config:
[windows]
maxDataSize = auto
maxHotBuckets = 3
maxWarmDBCount = 1200
homePath.maxDataSizeMB = 178000
homePath = volume:primary/windows/db
coldPath = volume:coldstorage/windows/colddb
thawedPath = $SPLUNK_DB/windows/thaweddb
repFactor = auto
btool output:
[windows]
assureUTF8 = false
bucketRebuildMemoryHint = auto
coldPath = volume:coldstorage/windows/colddb
coldPath.maxDataSizeMB = 0
coldToFrozenDir =
coldToFrozenScript =
compressRawdata = true
defaultDatabase = main
enableDataIntegrityControl = false
enableOnlineBucketRepair = true
enableRealtimeSearch = true
enableTsidxReduction = false
frozenTimePeriodInSecs = 31536000
homePath = volume:primary/windows/db
homePath.maxDataSizeMB = 178000
hotBucketTimeRefreshInterval = 10
indexThreads = auto
journalCompression = gzip
maxBloomBackfillBucketAge = 30d
maxBucketSizeCacheEntries = 0
maxConcurrentOptimizes = 6
maxDataSize = auto
maxHotBuckets = 3
maxHotIdleSecs = 0
maxHotSpanSecs = 7776000
maxMemMB = 5
maxMetaEntries = 1000000
maxRunningProcessGroups = 8
maxRunningProcessGroupsLowPriority = 1
maxTimeUnreplicatedNoAcks = 300
maxTimeUnreplicatedWithAcks = 60
maxTotalDataSizeMB = 10000000
maxWarmDBCount = 1200
memPoolMB = auto
minRawFileSyncSecs = disable
minStreamGroupQueueSize = 2000
partialServiceMetaPeriod = 0
processTrackerServiceInterval = 1
quarantineFutureSecs = 2592000
quarantinePastSecs = 77760000
rawChunkSizeBytes = 131072
repFactor = auto
rotatePeriodInSecs = 60
serviceMetaPeriod = 25
serviceOnlyAsNeeded = true
serviceSubtaskTimingPeriod = 30
streamingTargetTsidxSyncPeriodMsec = 5000
suppressBannerList =
sync = 0
syncMeta = true
thawedPath = $SPLUNK_DB/windows/thaweddb
throttleCheckPeriod = 15
timePeriodInSecBeforeTsidxReduction = 604800
tsidxReductionCheckPeriodInSec = 600
tstatsHomePath = volume:_splunk_summaries/$_index_name/datamodel_summary
warmToColdScript =
Based on these configurations, with maxDataSize = auto, I would expect hot buckets to get to around 750MB before rolling - since the maxHotSpanSecs = 7776000 which is 90 days. What I am actually seeing is that the buckets are rolling before they reach 100MB in most cases, some manage to make it to 400MB but that is the largest I see them get before they roll. This has resulted in a bucket count for this index of over 10,000 buckets. Each indexer has between 3,000 and 3,500 buckets. During the restart, roughly 3 buckets are registered per second - meaning just registering the windows index buckets with the cluster master takes almost 20 minutes. I have researched the _internal index looking for causality of the hot buckets rolling prematurely and I do not see anything useful:
07-27-2017 04:38:17.008 +0000 INFO IndexWriter - idx=windows Creating hot bucket=hot_v1_1974, given event timestamped=1501126613
07-27-2017 04:38:17.008 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Adding bucket, bid=windows~1974~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 04:44:58.994 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Adding bucket, bid=windows~6~4156302F-A116-477D-B637-37532D2BA1B3'
07-27-2017 05:25:10.048 +0000 INFO ProcessTracker - (child_137__Fsck) Fsck - (bloomfilter only) Rebuild for bucket='/opt/splunk/var/lib/splunk/windows/db/db_1501126138_1501060973_1973_8AEF590F-9B95-451A-B7FD-AF5233075C49' took 977.2 milliseconds
07-27-2017 05:25:11.582 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=1 . Reason='Updating manifest: bucketUpdates=1'
07-27-2017 05:25:16.726 +0000 INFO ProcessTracker - (child_138__Fsck) Fsck - (bloomfilter only) Rebuild for bucket='/opt/splunk/var/lib/splunk/windows/db/db_1501060972_1500929755_1972_8AEF590F-9B95-451A-B7FD-AF5233075C49' took 5.586 seconds
07-27-2017 05:25:17.582 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=1 . Reason='Updating manifest: bucketUpdates=1'
07-27-2017 05:25:29.983 +0000 INFO ProcessTracker - (child_139__Fsck) Fsck - (bloomfilter only) Rebuild for bucket='/opt/splunk/var/lib/splunk/windows/db/db_1500549595_1495828554_1968_8AEF590F-9B95-451A-B7FD-AF5233075C49' took 12.92 seconds
07-27-2017 05:25:30.581 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=1 . Reason='Updating manifest: bucketUpdates=1'
07-27-2017 05:25:46.098 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Buckets were rebuilt or tsidx-minified (bucket_count=3).'
07-27-2017 10:59:18.818 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Updating bucket, bid=windows~1974~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 10:59:18.879 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Updating bucket, bid=windows~1974~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 10:59:19.662 +0000 INFO IndexWriter - idx=windows Creating hot bucket=hot_v1_1975, given event timestamped=1501153150
07-27-2017 10:59:19.663 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Adding bucket, bid=windows~1975~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 10:59:53.266 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Adding bucket, bid=windows~965~67EA25F7-D999-4D31-8A46-7EF1B0B281AC'
07-27-2017 10:59:54.694 +0000 INFO BucketMover - idx=windows Moving bucket='rb_1497113527_1497042556_1586_78CBDAF4-4C0B-4493-8361-047DCAFF4113' because maximum number of warm databases exceeded, starting warm_to_cold: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 10:59:54.694 +0000 INFO BucketMover - idx=windows bucket=rb_1497113527_1497042556_1586_78CBDAF4-4C0B-4493-8361-047DCAFF4113 Firing async chiller: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 10:59:58.504 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Adding bucket, bid=windows~2432~2B7D2EBA-D465-457F-ACB2-DF6CBAE09797'
07-27-2017 11:00:13.745 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Updating bucket, bid=windows~1586~78CBDAF4-4C0B-4493-8361-047DCAFF4113'
07-27-2017 11:01:45.592 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Adding bucket, bid=windows~1306~48BA7BF9-6D4B-47B9-A971-572507094A78'
07-27-2017 11:02:34.123 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Updating bucket, bid=windows~965~67EA25F7-D999-4D31-8A46-7EF1B0B281AC'
07-27-2017 11:02:35.407 +0000 INFO BucketMover - idx=windows Moving bucket='rb_1497127317_1497060191_1154_F1E6A997-949C-48AA-A3E4-48762822B93A' because maximum number of warm databases exceeded, starting warm_to_cold: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 11:02:35.407 +0000 INFO BucketMover - idx=windows bucket=rb_1497127317_1497060191_1154_F1E6A997-949C-48AA-A3E4-48762822B93A Firing async chiller: from='/opt/splunk/var/lib/splunk/windows/db' to='/mnt/splunk/cold/windows/colddb'
07-27-2017 11:02:35.857 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Buckets were rebuilt or tsidx-minified (bucket_count=2).'
07-27-2017 11:02:35.964 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Updating bucket, bid=windows~1975~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 11:02:35.997 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Updating bucket, bid=windows~1975~8AEF590F-9B95-451A-B7FD-AF5233075C49'
07-27-2017 11:02:44.529 +0000 INFO DatabaseDirectoryManager - idx=windows Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/windows/db', pendingBucketUpdates=0 . Reason='Updating bucket, bid=windows~2432~2B7D2EBA-D465-457F-ACB2-DF6CBAE09797'
Does anyone out there have any additional suggestions for how to troubleshoot this issue?
... View more