Hello
I had that red warning right before the username in splunk and after analyzing I found that there were a few sourcetypes with wrong timeparsing.
I have fixed all of these fails but the red warning is still appearing (there is approximately 1 hours since last parsing error)
I am curious if there are no anymore parsing errors, when the red warning will disappear?
Greetings @net1993,
Please post your version. After upgrading to 7.2.4 from 6.6.4, we are seeing the same error. According to @kheo_splunk on this Splunk answers, a small bucket is 10% of maxDataSize
for the index (although I couldn't find that in indexes.conf or health.conf). Here's as far as I've gotten with this:
On an indexer, click the health badge in header bar next to your user name, then Buckets
.
Buckets
Root Cause(s):
The percentage of small of buckets created (83) over the last hour is very high and exceeded the red thresholds (50) for index=windows, and possibly more indexes, on this indexer
Last 50 related messages:
08-16-2019 10:30:21.649 -0400 INFO HotBucketRoller - finished moving hot to warm bid=services~920~0514B976-C45E-486C-B57C-A1E810AEC966 idx=services from=hot_v1_920 to=db_1565890631_1565852558_920_0514B976-C45E-486C-B57C-A1E810AEC966 size=393109504 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots
08-16-2019 10:00:03.781 -0400 INFO HotBucketRoller - finished moving hot to warm bid=windows~145~0514B976-C45E-486C-B57C-A1E810AEC966 idx=windows from=hot_v1_145 to=db_1565761563_1564808117_145_0514B976-C45E-486C-B57C-A1E810AEC966 size=1052672 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots
We have two indexers. The two indexers have different numbers (83 on Indexer 1, 66 on Indexer 2) and errors so it appears to be checking them separately. As a side note, I do not believe the over the last hour
part of the error is accurate. The setting to change this is indicator:percent_small_buckets_created_last_24h
which leads me to believe the search is over the past 24 hours.
Run the following search for either yesterday or the previous 24 hours. I haven't narrowed down the exact time frame, but it does seem to be some variation of 24 hours.
index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index
| join type=left index
[ | rest /services/data/indexes count=0
| rename title as index
| eval maxDataSize = case (maxDataSize == "auto", 750,
maxDataSize == "auto_high_volume", 10000,
true(), maxDataSize)
| table index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket = if (bucketSizePercent < 10, 1, 0)
| stats sum(isSmallBucket) as num_small_buckets
count as num_total_buckets
by index splunk_server
| eval percentSmallBuckets = round(100*(num_small_buckets/num_total_buckets))
| sort - percentSmallBuckets
| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")
Breaking it down,
index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index
Get each instance of a hot bucket rolling over to a warm bucket. Rename to "index" for the join to work properly. Has the size of the now warm bucket.
| join type=left index
[ | rest /services/data/indexes count=0
| rename title as index
| eval maxDataSize = case (maxDataSize == "auto", 750,
maxDataSize == "auto_high_volume", 10000,
true(), maxDataSize)
| table index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
Join each instance of a rollover event to a rest call to get the maxDataSize for that index. A value of "auto" is 750MB. "auto_high_volume" is 10GB (or 1GB on 32 bit systems). The rest is pretty self-explanatory, but I'll explain a few lines.
| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket = if (bucketSizePercent < 10, 1, 0)
Apparently a small bucket is <10% of the maxDataSize for the index.
| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")
The standard setting for a violation is >30%.
This still does not fully work for me, but I believe the answer is close to this. I tried past 60 minutes (definitely not), past 24 hours, Today, and Yesterday. None of the values match, although I do see "violations". One thing I did notice is that the numbers displayed in the error (83 and 66 for me) do not seem to change as if this check is not running often (every 4 hours? Once a day?).
If anyone sees anything wrong, just let me know.
Edit: fixed one issue. This query is now close enough to accurate for my purposes. It does work to find indexes with a high percent of small buckets, it just doesn't match the numbers that Splunk shows.
Greetings @net1993,
Please post your version. After upgrading to 7.2.4 from 6.6.4, we are seeing the same error. According to @kheo_splunk on this Splunk answers, a small bucket is 10% of maxDataSize
for the index (although I couldn't find that in indexes.conf or health.conf). Here's as far as I've gotten with this:
On an indexer, click the health badge in header bar next to your user name, then Buckets
.
Buckets
Root Cause(s):
The percentage of small of buckets created (83) over the last hour is very high and exceeded the red thresholds (50) for index=windows, and possibly more indexes, on this indexer
Last 50 related messages:
08-16-2019 10:30:21.649 -0400 INFO HotBucketRoller - finished moving hot to warm bid=services~920~0514B976-C45E-486C-B57C-A1E810AEC966 idx=services from=hot_v1_920 to=db_1565890631_1565852558_920_0514B976-C45E-486C-B57C-A1E810AEC966 size=393109504 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots
08-16-2019 10:00:03.781 -0400 INFO HotBucketRoller - finished moving hot to warm bid=windows~145~0514B976-C45E-486C-B57C-A1E810AEC966 idx=windows from=hot_v1_145 to=db_1565761563_1564808117_145_0514B976-C45E-486C-B57C-A1E810AEC966 size=1052672 caller=lru maxHotBuckets=3, count=4 hot buckets,evicting_count=1 LRU hots
We have two indexers. The two indexers have different numbers (83 on Indexer 1, 66 on Indexer 2) and errors so it appears to be checking them separately. As a side note, I do not believe the over the last hour
part of the error is accurate. The setting to change this is indicator:percent_small_buckets_created_last_24h
which leads me to believe the search is over the past 24 hours.
Run the following search for either yesterday or the previous 24 hours. I haven't narrowed down the exact time frame, but it does seem to be some variation of 24 hours.
index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index
| join type=left index
[ | rest /services/data/indexes count=0
| rename title as index
| eval maxDataSize = case (maxDataSize == "auto", 750,
maxDataSize == "auto_high_volume", 10000,
true(), maxDataSize)
| table index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket = if (bucketSizePercent < 10, 1, 0)
| stats sum(isSmallBucket) as num_small_buckets
count as num_total_buckets
by index splunk_server
| eval percentSmallBuckets = round(100*(num_small_buckets/num_total_buckets))
| sort - percentSmallBuckets
| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")
Breaking it down,
index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
| eval bucketSizeMB = round(size / 1024 / 1024, 2)
| table _time splunk_server idx bid bucketSizeMB
| rename idx as index
Get each instance of a hot bucket rolling over to a warm bucket. Rename to "index" for the join to work properly. Has the size of the now warm bucket.
| join type=left index
[ | rest /services/data/indexes count=0
| rename title as index
| eval maxDataSize = case (maxDataSize == "auto", 750,
maxDataSize == "auto_high_volume", 10000,
true(), maxDataSize)
| table index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
Join each instance of a rollover event to a rest call to get the maxDataSize for that index. A value of "auto" is 750MB. "auto_high_volume" is 10GB (or 1GB on 32 bit systems). The rest is pretty self-explanatory, but I'll explain a few lines.
| eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
| eval isSmallBucket = if (bucketSizePercent < 10, 1, 0)
Apparently a small bucket is <10% of the maxDataSize for the index.
| eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")
The standard setting for a violation is >30%.
This still does not fully work for me, but I believe the answer is close to this. I tried past 60 minutes (definitely not), past 24 hours, Today, and Yesterday. None of the values match, although I do see "violations". One thing I did notice is that the numbers displayed in the error (83 and 66 for me) do not seem to change as if this check is not running often (every 4 hours? Once a day?).
If anyone sees anything wrong, just let me know.
Edit: fixed one issue. This query is now close enough to accurate for my purposes. It does work to find indexes with a high percent of small buckets, it just doesn't match the numbers that Splunk shows.
I encountered this problem as well.
The alerts was triggered by restarting an indexer cluster peer, which caused this peer to roll all its indexes. I believe this is just a one-time thing, and the internal logs shows that the hotBucketRoller is working perfectly normal.
My problem is that the alerts stayed here for almost a day now.
@jacobevans Have you found how long the health status alert will stay?
I'm not sure. It seemed to be in the range of 24 hours, but I could never perfectly emulate it.
My error disappear once I initiated a restart. It seems that after the restart, Splunk reruns the search and the issue is gone.
That's good to know. I wonder if running the health check might also fix it without the need for a restart.
Great answer, just helped me to find an issue! Thanks
cheers, MuS
Happy to share something that took much longer than it should've. Thanks for marking as answer MuS, cheers.
Hello @jacobevans
Thank you for the extensive comment.
I am though a bit unsure what you are trying to say. Do you mean this is splunk bug or you mean that this is correct and there is a problem with some sourcetype?
I experience the behaviour on splunk 7.2.6 and I believe the errors started to appear after we upgrader 2-3 months ago from v. 6.6.3
I 've just tried the search but I get no results for 24h but only for 7 days so I don't understand why it is stated for the last hour.
Also when I run the search for 7 days, all of the result rows are IsViolations:"No"
I added a bit more info to this comment here - it was built off of this one: https://answers.splunk.com/answers/725555/what-does-this-message-mean-regarding-the-health-s.html#an...
Long story short, I do not think this is a bug. I believe Splunk is making a fairly simple calculation which is probably much better than my query, and that the health indicator was simply not a part of anything in 6x hence why you never noticed it before. I have a few other queries I can throw your way if you're interested. When you say my query showed nothing, was it the first main query in the answer or the dbinspect?
sure, give them to me please.
Yes, I used the very first query: 19rows. I run it for 7 days back and the results which are comming up have only isViolation=No.
Variation of dbinspect to sort by the raw bucket size. Remember, put a large time scale on this since the time selector applies to the data in the buckets - not the date the bucket was rolled over. Also, if you're using default settings, then the standard max bucket size is 750MB so a small bucket size is 75MB. In general, if you're running the original query I gave you, you want to run it for yesterday or the last 24 hours to get semi-accurate results.
| dbinspect index=_* index=*
| search state!=hot
| convert ctime(startEpoch) as startTime
| convert ctime(endEpoch) as endTime
| search sizeOnDiskMB < 75
| sort sizeOnDiskMB
| eval sizeOnDiskMB=round(sizeOnDiskMB, 2)
| stats values(splunk_server) as splunk_servers
values(sizeOnDiskMB) as sizesOnDiskMB
values(modTime) as modTimes
by index id startTime endTime eventCount hostCount sourceTypeCount sourceCount state tsidxState
Another variation to get bucket size statistics
| dbinspect index=_* index=*
| search state!=hot
| convert ctime(startEpoch) as startTime
| convert ctime(endEpoch) as endTime
| stats count
min(sizeOnDiskMB) as MinSizeOnDiskMB
avg(sizeOnDiskMB) as AvgSizeOnDiskMB
max(sizeOnDiskMB) as MaxSizeOnDiskMB
by index
| sort AvgSizeOnDiskMB
Once you find a small bucket, get info about it:
| dbinspect index=[index]
| search bucketId="[index]~197~C69837B0-EF8A-47EA-B75E-3640B0F6BB13"
| convert ctime(startEpoch) as startTime
| convert ctime(endEpoch) as endTime
| table bucketId id splunk_server index state modTime startTime endTime hostCount sourceTypeCount sourceCount eventCount sizeOnDiskMB path
Copy the bucketId into this query over the time period of startTime to endTime:
index=[index]
| eval cd = _cd, bkt = _bkt
| search bkt="[index]~197~C69837B0-EF8A-47EA-B75E-3640B0F6BB13"
| stats count by host sourcetype source bkt
If it's not blatantly obvious by this point, try this query. 1 hour worked for latency for me, but you may need to adjust
index=[index]
| eval latency_hours = round(abs((_indextime-_time)/60/60), 2)
| search latency_hours > 1
| sort - latency_hours
Or, instead of sort,
| stats count max(latency_hours) avg(latency_hours) by index splunk_server host sourcetype source
Thank you so much. I will test these searches and return back.
Also, here's a query to check your buckets without using the _internal
log. Watch out for the time selector though - it applies to the "endEpoch" (endTime in my query) field (this is the maximum _time
of events in the bucket - it has nothing to do with when the bucket rolled over) - not the modTime as I would have expected. Thus, you'd need to give a larger time span, and then filter based on modTime.
| dbinspect index=windows
| search state!=hot
| convert ctime(startEpoch) as startTime
| convert ctime(endEpoch) as endTime
| eval sizeOnDiskMB=round(sizeOnDiskMB, 2)
| stats values(splunk_server) as splunk_servers
values(sizeOnDiskMB) as sizesOnDiskMB
values(modTime) as modTimes
by index id startTime endTime eventCount hostCount sourceTypeCount sourceCount state tsidxState
If you're still stuck, I've added additional information on this answer: https://answers.splunk.com/answers/725555/what-does-this-message-mean-regarding-the-health-s.html?ch...
None of you know this?
I have also checked that actually, there are no high number of hot buckets neither there have been in the past. There are only 3 hot buckets which should be normal.