I have been getting the following type message for the _internal and other indexes: The percentage of small of buckets created (75) over the last hour is very high and exceeded the red thresholds (50) for index=_internal, and possibly more indexes, on this indexer.
What could be causes of this and how do I go about troubleshooting to determine what the cause of this may be? I have not been able to find anything yet in logs.
I have the same case with Splunk 7.2.3 on my test env.
I just have installed Splunk and have set up the whole arch with 1 DS, 1Cluster Master, 2 Indexers, 1 Search head, 1 Application server.
I have all my servers in same timezones which are having exact same timing on all the servers.
Also, I have not even started forwarding any data to my indexers.
So I can roll out both the possibilities like coming data from data sources as well as timezone issues.
Do you guys have any other suggestion than this which can help to solve this issue?
You probably already did this but if not, i would grep your splunkd.log for words like DateParserVerbose and WARN and ERROR etc.. You could also see if you can find exactly what index has too many buckets. If it is one of your own or one of the built in Splunk indexes, that could help narrow your search for the issue.
I too opened a case with Splunk regarding this, and we identified that the cause was due to both problems with the timezone parsing causing Splunk to think that events were months in the future, and also due to the source systems being in different timezones, and some sending records with timestamps in the local (eg EST, CST, MST) tz, and others sending UTC, so it appeared to be 5 hours in the future.
The months in the future issues were fixed by modifying the timestamp parsers for those records, the timezone offset issue I am still trying to work our how to solve.
Yes, as my reply above says, we resolved this issue. It was mainly fixed by running a search to show us events "from the future" (eg "*" using date/time range between half an hour or so in the future and say a week in the future) , in order to identify the data sources which were configured and/or had datestamps being parsed incorrectly (eg UTC/GMT being parsed as 5 hours in the future, since we are in EST/UTC-5), and then fixing all of these to return correct/sane values, and also by increasing our hot buckets for the indexer DB, which had been set to 3 (I think I set it to 5), and finally restarting Splunk.
It's been a couple of weeks now and the warning has not returned.
In case anyone else finds this, I've done a pretty big write-up on two answers. Start here: https://answers.splunk.com/answers/725555/what-does-this-message-mean-regarding-the-health-s.html?ch...
Not sure exactly what you are asking but Splunk support closed the case right after they provided the DateParserVerbose error answer. As far as I can tell that is likely the problem. I still have some bad data sources from databases etc that have output that throws the error and makes too many buckets every once and a while.
I have been dealing with Splunk support on this issue. They think it may be the fact that I have some data sources that have events that are way off in time from other events with the same source (due to some devices incorrectly being set in different time zone than all others). When time is that far off, apparently Splunk does not know how to deal with it and sticks it in a separate index. Do you see messages like this in your splunkd.log?
12-19-2018 10:08:38.875 -0500 WARN DateParserVerbose - Accepted time (Wed Dec 19 10:08:35 2018) is suspiciously far away from the previous event's time (Wed Dec 19 15:08:35 2018), but still accepted because it was extracted by the same pattern. Context: source=/log/switch/switch.log|host=XXXXX|switch|401549
You know, actually you bring up a great time. I will need to recheck the NTP settings on my Splunk Enterprise Indexers just to make sure. Thanks this is very useful!
There wasn't a setting in indexes.conf that you could tweak to change the hot bucket rollover threshold from 50 to 75?