What does this warning mean and how does it impact the performance?
The percentage of small of buckets created (40) over the last hour is very high and exceeded the yellow thresholds (30) for index=win, and possibly more indexes, on this indexer
What to do to remediate?
We have this issue again when we are upgrading to 7.3.3 to fix timestamp issue:
The percentage of small of buckets created xx over the last hour is very high and exceeded the red thresholds (xx) for index=xxxxx, and possibly more indexes, on this indexer
I wonder if we can search for the source/host the time is not correct, so we can fix it. We have about 1200 forwarders.
Yes, the issues are still resolved, for months now. I haven't seen the warning again.
The main cause of this issue is most likely going to be because the timestamps on the data you are feeding in are all over the place. Splunk wants to be mostly chronological, so the buckets contain data from a certain window of time.
For example, if your max buckets was set to say 3, and they will hold data from 6 hour windows, and you are feeding in data which is within an 18 hour window, things will be fine, as it can put it in the appropriate 6 hour bucket. If you then give it some data with a random time from outside that 18 hour window, it doesn't have a bucket to put it in.
This will cause it to force close/roll one of the hot buckets early, and create a new bucket for the "random" data. If you then resume giving it data from within the 6 hour window of the hot bucket which was just force rolled to warm in order to create the hot bucket for the random data, now you're back to the same issue; Splunk doesn't have a bucket to put it in.
This will cause a repeat of the "roll hot to warm; create new bucket" process, on one of the other buckets, causing it to roll early. Warm buckets cannot be rolled back to hot buckets, Splunk only creates new ones, so if you keep feeding it data with timestamps all over the place, outside of the window of time it has buckets for, it's going to cause buckets to have to be rolled early and new, overlapping ones created constantly, which is what causes this issue.
As I said above, the 2 things which fixed it for me was to perform searches of data "from the future", which identified data sources that were in GMT/UTC vs GMT-5/-6/-7 (where my sources actually are), and to fix the timezone of the timestamps at the source, (or by making sure props.conf contains parser definitions for your specific data sources in order to mangle them to the correct timezone), and also by increasing the maximum hot buckets Splunk is allowed to juggle at once, in my case from 3 to 5.
Just increasing the max buckets may "fix" the issue, but now you're potentially going to end up with records that are incorrectly timestamped all over the place, and your queries won't even find/return them, since they will be in buckets which don't get searched as they are outside the time frame your query is looking in.
The above health warning message is shown when the number of hot buckets created to index the data in particular hour/day crosses or reaches the threshold limit defined in system/health.conf.
Reason behind the creation of too many hot buckets:
Splunk uses buckets as an index directory to index the data. Now, when an event comes to Splunk for indexing, the new hot bucket will be created for that event or the event is indexed to the existing one of the hot buckets as per the event's timestamp and constraints on the index(refer to maxHotBuckets, maxHotSpanSecs and quarantinePastSecs parameters of indexes.conf).
So, if the event is having the older/strange timestamp than acceptable for the existing buckets, then Splunk will create a new hot bucket to index that event. Hence, the number of hot buckets created may reach or cross the threshold.
When DATETIME_CONFIG parameter of props.conf is not set explicitly, Splunk will try to find the timestamp in an event itself. So, While extracting the timestamp from the events, if Splunk finds the value that matches the valid timestamp format, it would parse the same value as event time and would index it using that timestamp only, which may cause Splunk to create new hot bucket as the timestamp is not properly extracted.
Consider adding timestamp attributes in propps.conf to make Splunk extract timestamp properly.
Link: Configure timestamp recognition
Even using your app, and looking in the log files, I couldn't find any evidence that it was actually creating 60% new buckets each hour as it claimed to be. I modified the /opt/splunk/etc/system/local/indexes.conf file and added the following to turn the internal buckets up to 5, from 3:
[_internal] homePath = $SPLUNK_DB/_internaldb/db coldPath = $SPLUNK_DB/_internaldb/colddb thawedPath = $SPLUNK_DB/_internaldb/thaweddb maxHotBuckets=5
After a restart it's now showing green and has stayed that way for about 20 hours now.
Issues still resolved simonq?
I have this very issue on my indexer warnings... haven't dealt with it yet as I dont have a good enough understanding of the real issue. Sounds like its confusing to a lot of people too. Is increasing the maxHotBuckets on the index listed in the error the correct way to address this?
I can' change the time setting of the incoming log, and the props.conf is fine.
As per simonq's comment, the most common cause is a large variance in timespan of the data coming in, this often relates to incorrect parsing of the data.
In Alerts for Splunk Admins I have a dashboard for issues by sourcetype and alerts around this, github link here. The monitoring console in modern versions also has a "Data Quality" tab which would help you here.
If the variation in timestamp is actually required for some reason you could increase the number of hot buckets in the indexes.conf (maxHotBuckets) however you would likely be better served by fixing any timestamp parsing issues...(maxHotBuckets does not normally need to be adjusted)
Try the dashboards:
Issues Per Sourcetype (for information about sourcetype issues, you will need to know the sourcetypes in the index in question)
or start with "Rolled Buckets By Index"
Same issue here, started last night. Would be nice to how to troubleshoot further. The affected index isn't particularly high volume nor have I observed any other unusual activity lately.
See the other thread which is pretty much a duplicate of this one, here: https://answers.splunk.com/answers/701550/health-status-the-percentage-of-small-of-buckets-c.html#an...
Long story short is that the issue is caused by data which is arriving out of chronological order, caused by timezone parsing or sending issues, which needs to be fixed by modifying your timezone parsers, and/or reconfiguring the sending systems to include TZ offset, or the props.conf to specify the TZ of each source.
I am having the exact same issue. I have opened a case with Splunk Support, but all they did was copy and paste the response kheo provided.
In our environment we have not restarted Splunk for months, so this is not the cause of it prematurely rolling hot buckets.
The alert is triggered when the percentage of a small bucket(by definition, less than 10% of maxDataSize for the index) is created more than the current thresholds(30) for the last 24 hours.
Please check the relevant configuration file as below:
display_name = Buckets
indicator:buckets_created_last_60m:yellow = 40
indicator:percent_small_buckets_created_last_24h:description = This indicator tracks the percentage of small buckets created over the last 24 hours. A small bucket is defined as less than 10 % of the ‘maxDataSize’ setting in indexes.conf.
indicator:percent_small_buckets_created_last_24h:red = 50
If you'd like to disable or suppress the message, please check the following Splunk doc. http://docs.splunk.com/Documentation/Splunk/7.2.1/DMC/Configurefeaturemonitoring
As hot buckets roll to warm whenever Splunk is restarted, warm buckets can be created with a smaller size than the specified max size(‘maxDataSize’) for the index.
If this is the case in your environment, Splunk does not have control over this behaviour and having smaller buckets will not affect any performance issue.
If smaller buckets come from other reasons, you may need to investigate the reason.