Solved: The percentage of small of buckets is very high an...

net1993 · ‎08-08-2019

Hello
I had that red warning right before the username in splunk and after analyzing I found that there were a few sourcetypes with wrong timeparsing.
I have fixed all of these fails but the red warning is still appearing (there is approximately 1 hours since last parsing error)
I am curious if there are no anymore parsing errors, when the red warning will disappear?

jacobpevans · ‎08-16-2019

Greetings @net1993,

Please post your version. After upgrading to 7.2.4 from 6.6.4, we are seeing the same error. According to @kheo_splunk on this Splunk answers, a small bucket is 10% of maxDataSize for the index (although I couldn't find that in indexes.conf or health.conf). Here's as far as I've gotten with this:

Error

On an indexer, click the health badge in header bar next to your user name, then Buckets.

Buckets
Root Cause(s):
The percentage of small of buckets created (83) over the last hour is very high and exceeded the red thresholds (50) for index=windows, and possibly more indexes, on this indexer
Last 50 related messages:
08-16-2019 10:30:21.649 -0400 INFO HotBucketRoller - finished moving hot to warm bid=services~920~0514B976-C45E-486C-B57C-A1E810AEC966 idx=services from=hot_v1_920 to=db_1565890631_1565852558_920_0514B976-C45E-486C-B57C-A1E810AEC966 size=393109504 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots
08-16-2019 10:00:03.781 -0400 INFO HotBucketRoller - finished moving hot to warm bid=windows~145~0514B976-C45E-486C-B57C-A1E810AEC966 idx=windows from=hot_v1_145 to=db_1565761563_1564808117_145_0514B976-C45E-486C-B57C-A1E810AEC966 size=1052672 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots

We have two indexers. The two indexers have different numbers (83 on Indexer 1, 66 on Indexer 2) and errors so it appears to be checking them separately. As a side note, I do not believe the over the last hour part of the error is accurate. The setting to change this is indicator:percent_small_buckets_created_last_24h which leads me to believe the search is over the past 24 hours.

Queries

Run the following search for either yesterday or the previous 24 hours. I haven't narrowed down the exact time frame, but it does seem to be some variation of 24 hours.

index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index
| join type=left index 
    [ | rest /services/data/indexes count=0
      | rename title as index
      | eval maxDataSize = case (maxDataSize == "auto",             750,
                                 maxDataSize == "auto_high_volume", 10000,
                                 true(),                            maxDataSize)
      | table  index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket     = if (bucketSizePercent < 10, 1, 0)
| stats sum(isSmallBucket) as num_small_buckets
        count              as num_total_buckets
        by index splunk_server
| eval  percentSmallBuckets = round(100*(num_small_buckets/num_total_buckets))
| sort  - percentSmallBuckets
| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")

Breaking it down,

index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index

Get each instance of a hot bucket rolling over to a warm bucket. Rename to "index" for the join to work properly. Has the size of the now warm bucket.

| join type=left index 
    [ | rest /services/data/indexes count=0
      | rename title as index
      | eval maxDataSize = case (maxDataSize == "auto",             750,
                                 maxDataSize == "auto_high_volume", 10000,
                                 true(),                            maxDataSize)
      | table  index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]

Join each instance of a rollover event to a rest call to get the maxDataSize for that index. A value of "auto" is 750MB. "auto_high_volume" is 10GB (or 1GB on 32 bit systems). The rest is pretty self-explanatory, but I'll explain a few lines.

| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket     = if (bucketSizePercent < 10, 1, 0)

Apparently a small bucket is <10% of the maxDataSize for the index.

| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")

The standard setting for a violation is >30%.

This still does not fully work for me, but I believe the answer is close to this. I tried past 60 minutes (definitely not), past 24 hours, Today, and Yesterday. None of the values match, although I do see "violations". One thing I did notice is that the numbers displayed in the error (83 and 66 for me) do not seem to change as if this check is not running often (every 4 hours? Once a day?).

If anyone sees anything wrong, just let me know.

Edit: fixed one issue. This query is now close enough to accurate for my purposes. It does work to find indexes with a high percent of small buckets, it just doesn't match the numbers that Splunk shows.

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

View solution in original post

jacobpevans · ‎08-16-2019

Greetings @net1993,

Please post your version. After upgrading to 7.2.4 from 6.6.4, we are seeing the same error. According to @kheo_splunk on this Splunk answers, a small bucket is 10% of maxDataSize for the index (although I couldn't find that in indexes.conf or health.conf). Here's as far as I've gotten with this:

Error

On an indexer, click the health badge in header bar next to your user name, then Buckets.

Buckets
Root Cause(s):
The percentage of small of buckets created (83) over the last hour is very high and exceeded the red thresholds (50) for index=windows, and possibly more indexes, on this indexer
Last 50 related messages:
08-16-2019 10:30:21.649 -0400 INFO HotBucketRoller - finished moving hot to warm bid=services~920~0514B976-C45E-486C-B57C-A1E810AEC966 idx=services from=hot_v1_920 to=db_1565890631_1565852558_920_0514B976-C45E-486C-B57C-A1E810AEC966 size=393109504 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots
08-16-2019 10:00:03.781 -0400 INFO HotBucketRoller - finished moving hot to warm bid=windows~145~0514B976-C45E-486C-B57C-A1E810AEC966 idx=windows from=hot_v1_145 to=db_1565761563_1564808117_145_0514B976-C45E-486C-B57C-A1E810AEC966 size=1052672 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots

We have two indexers. The two indexers have different numbers (83 on Indexer 1, 66 on Indexer 2) and errors so it appears to be checking them separately. As a side note, I do not believe the over the last hour part of the error is accurate. The setting to change this is indicator:percent_small_buckets_created_last_24h which leads me to believe the search is over the past 24 hours.

Queries

Run the following search for either yesterday or the previous 24 hours. I haven't narrowed down the exact time frame, but it does seem to be some variation of 24 hours.

index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index
| join type=left index 
    [ | rest /services/data/indexes count=0
      | rename title as index
      | eval maxDataSize = case (maxDataSize == "auto",             750,
                                 maxDataSize == "auto_high_volume", 10000,
                                 true(),                            maxDataSize)
      | table  index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket     = if (bucketSizePercent < 10, 1, 0)
| stats sum(isSmallBucket) as num_small_buckets
        count              as num_total_buckets
        by index splunk_server
| eval  percentSmallBuckets = round(100*(num_small_buckets/num_total_buckets))
| sort  - percentSmallBuckets
| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")

Breaking it down,

index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index

Get each instance of a hot bucket rolling over to a warm bucket. Rename to "index" for the join to work properly. Has the size of the now warm bucket.

| join type=left index 
    [ | rest /services/data/indexes count=0
      | rename title as index
      | eval maxDataSize = case (maxDataSize == "auto",             750,
                                 maxDataSize == "auto_high_volume", 10000,
                                 true(),                            maxDataSize)
      | table  index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]

Join each instance of a rollover event to a rest call to get the maxDataSize for that index. A value of "auto" is 750MB. "auto_high_volume" is 10GB (or 1GB on 32 bit systems). The rest is pretty self-explanatory, but I'll explain a few lines.

| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket     = if (bucketSizePercent < 10, 1, 0)

Apparently a small bucket is <10% of the maxDataSize for the index.

| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")

The standard setting for a violation is >30%.

This still does not fully work for me, but I believe the answer is close to this. I tried past 60 minutes (definitely not), past 24 hours, Today, and Yesterday. None of the values match, although I do see "violations". One thing I did notice is that the numbers displayed in the error (83 and 66 for me) do not seem to change as if this check is not running often (every 4 hours? Once a day?).

If anyone sees anything wrong, just let me know.

Edit: fixed one issue. This query is now close enough to accurate for my purposes. It does work to find indexes with a high percent of small buckets, it just doesn't match the numbers that Splunk shows.

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

natalielam · ‎05-28-2020

I encountered this problem as well.
The alerts was triggered by restarting an indexer cluster peer, which caused this peer to roll all its indexes. I believe this is just a one-time thing, and the internal logs shows that the hotBucketRoller is working perfectly normal.
My problem is that the alerts stayed here for almost a day now.
@jacobevans Have you found how long the health status alert will stay?

jacobpevans · ‎05-28-2020

I'm not sure. It seemed to be in the range of 24 hours, but I could never perfectly emulate it.

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

natalielam · ‎05-28-2020

My error disappear once I initiated a restart. It seems that after the restart, Splunk reruns the search and the issue is gone.

jacobpevans · ‎05-31-2020

That's good to know. I wonder if running the health check might also fix it without the need for a restart.

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

MuS · ‎02-18-2020

Great answer, just helped me to find an issue! Thanks

cheers, MuS

jacobpevans · ‎03-05-2020

Happy to share something that took much longer than it should've. Thanks for marking as answer MuS, cheers.

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

net1993 · ‎08-26-2019

Hello @jacobevans
Thank you for the extensive comment.
I am though a bit unsure what you are trying to say. Do you mean this is splunk bug or you mean that this is correct and there is a problem with some sourcetype?
I experience the behaviour on splunk 7.2.6 and I believe the errors started to appear after we upgrader 2-3 months ago from v. 6.6.3
I 've just tried the search but I get no results for 24h but only for 7 days so I don't understand why it is stated for the last hour.
Also when I run the search for 7 days, all of the result rows are IsViolations:"No"

jacobpevans · ‎08-26-2019

I added a bit more info to this comment here - it was built off of this one: https://answers.splunk.com/answers/725555/what-does-this-message-mean-regarding-the-health-s.html#an...

Long story short, I do not think this is a bug. I believe Splunk is making a fairly simple calculation which is probably much better than my query, and that the health indicator was simply not a part of anything in 6x hence why you never noticed it before. I have a few other queries I can throw your way if you're interested. When you say my query showed nothing, was it the first main query in the answer or the dbinspect?

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

net1993 · ‎08-26-2019

sure, give them to me please.
Yes, I used the very first query: 19rows. I run it for 7 days back and the results which are comming up have only isViolation=No.

jacobpevans · ‎08-27-2019

Variation of dbinspect to sort by the raw bucket size. Remember, put a large time scale on this since the time selector applies to the data in the buckets - not the date the bucket was rolled over. Also, if you're using default settings, then the standard max bucket size is 750MB so a small bucket size is 75MB. In general, if you're running the original query I gave you, you want to run it for yesterday or the last 24 hours to get semi-accurate results.

| dbinspect index=_* index=*
| search state!=hot
| convert ctime(startEpoch) as startTime
| convert ctime(endEpoch)   as endTime
| search sizeOnDiskMB < 75
| sort sizeOnDiskMB
| eval sizeOnDiskMB=round(sizeOnDiskMB, 2)
| stats values(splunk_server) as splunk_servers
        values(sizeOnDiskMB)  as sizesOnDiskMB
        values(modTime)       as modTimes
    by index id startTime endTime eventCount hostCount sourceTypeCount sourceCount state tsidxState

Another variation to get bucket size statistics

| dbinspect index=_* index=*
| search state!=hot
| convert ctime(startEpoch) as startTime
| convert ctime(endEpoch)   as endTime
| stats count
      min(sizeOnDiskMB) as MinSizeOnDiskMB
      avg(sizeOnDiskMB) as AvgSizeOnDiskMB
      max(sizeOnDiskMB) as MaxSizeOnDiskMB
  by index
| sort AvgSizeOnDiskMB

Once you find a small bucket, get info about it:

| dbinspect index=[index]
| search bucketId="[index]~197~C69837B0-EF8A-47EA-B75E-3640B0F6BB13"
| convert ctime(startEpoch) as startTime 
| convert ctime(endEpoch)   as endTime
| table bucketId id splunk_server index state modTime startTime endTime hostCount sourceTypeCount sourceCount eventCount sizeOnDiskMB path

Copy the bucketId into this query over the time period of startTime to endTime:

index=[index]
| eval cd = _cd, bkt = _bkt
| search bkt="[index]~197~C69837B0-EF8A-47EA-B75E-3640B0F6BB13"
| stats count by host sourcetype source bkt

If it's not blatantly obvious by this point, try this query. 1 hour worked for latency for me, but you may need to adjust

index=[index]
| eval   latency_hours = round(abs((_indextime-_time)/60/60), 2)
| search latency_hours > 1
| sort - latency_hours

Or, instead of sort,

| stats count max(latency_hours) avg(latency_hours) by index splunk_server host sourcetype source

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

net1993 · ‎08-27-2019

Thank you so much. I will test these searches and return back.

jacobpevans · ‎08-16-2019

Also, here's a query to check your buckets without using the _internal log. Watch out for the time selector though - it applies to the "endEpoch" (endTime in my query) field (this is the maximum _time of events in the bucket - it has nothing to do with when the bucket rolled over) - not the modTime as I would have expected. Thus, you'd need to give a larger time span, and then filter based on modTime.

| dbinspect index=windows
| search state!=hot
| convert ctime(startEpoch) as startTime
| convert ctime(endEpoch)   as endTime
| eval sizeOnDiskMB=round(sizeOnDiskMB, 2)
| stats values(splunk_server) as splunk_servers
        values(sizeOnDiskMB)  as sizesOnDiskMB
        values(modTime)       as modTimes
    by index id startTime endTime eventCount hostCount sourceTypeCount sourceCount state tsidxState

If you're still stuck, I've added additional information on this answer: https://answers.splunk.com/answers/725555/what-does-this-message-mean-regarding-the-health-s.html?ch...

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

net1993 · ‎08-09-2019

None of you know this?

net1993 · ‎08-09-2019

I have also checked that actually, there are no high number of hot buckets neither there have been in the past. There are only 3 hot buckets which should be normal.

The percentage of small of buckets is very high and exceeded the red thresholds...-When the red warning will disappear after fixed parsing?

time

Error

Queries

Error

Queries

Upcoming Webinar: Unmasking Insider Threats with Slunk Enterprise Security’s UEBA

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey

Join the Conversation

The percentage of small of buckets is very high and exceeded the red thresholds...-When the red warning will disappear after fixed parsing?

time

Error

Queries

Error

Queries

Upcoming Webinar: Unmasking Insider Threats with Slunk Enterprise Security’s UEBA

.conf25 technical session recap of Observability for Gen AI: Monitoring LLM ...

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey