Monitoring Splunk
Highlighted

What does this message mean regarding the health status of Splunkd?

Explorer

Hello splunkers,

I need your help. I have an alert about a bucket on my Splunk.

This it the message that I have:

"The percentage of small of buckets created (50) over the last hour is very high and exceeded the yellow thresholds (30) for index=_internal, and possibly more indexes, on this indexer"

What does it mean, and how can i fix it?

Tags (2)
Highlighted

Re: What does this message mean regarding the health status of Splunkd?

Motivator

This is most likely related to an issue with the event times - either your timestamp extraction is not working properly, your server times are way off, or your applications logging the wrong time.

Try investigating on this, as Splunk will create new buckets when the events coming into an index are outside of a certain time range. This will then cause the error.

0 Karma
Highlighted

Re: What does this message mean regarding the health status of Splunkd?

Explorer

is there a way to fix that issue? it may be an index configuration issue.

0 Karma
Highlighted

Re: What does this message mean regarding the health status of Splunkd?

Motivator

You can narrow down the issue by checking the index latency, which might be an indication where event timestamps might be off...

index=* index!=_* | eval latency=_indextime-_time | stats min(latency), max(latency), avg(latency), median(latency) by index 
Highlighted

Re: What does this message mean regarding the health status of Splunkd?

Motivator

I've been going crazy with this error, so I did a write-up here which has a query to identify the indexes that are being flagged.

Query

index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
 | eval bucketSizeMB = round(size / 1024 / 1024, 2)
 | table _time splunk_server idx bid bucketSizeMB
 | rename idx as index
 | join type=left index 
     [ | rest /services/data/indexes count=0
       | rename title as index
       | eval maxDataSize = case (maxDataSize == "auto",             750,
                                  maxDataSize == "auto_high_volume", 10000,
                                  true(),                            maxDataSize)
       | table  index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
 | eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
 | eval isSmallBucket     = if (bucketSizePercent < 10, 1, 0)
 | stats sum(isSmallBucket) as num_small_buckets
         count              as num_total_buckets
         by index splunk_server
 | eval  percentSmallBuckets = round(100*(num_small_buckets/num_total_buckets))
 | sort  - percentSmallBuckets
 | eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")

From there, and stealing from @DMohn, plug your index into this query:

index=abc
| eval latency=_indextime-_time
| stats min(latency),
        max(latency),
        avg(latency),
        median(latency)
    by index sourcetype

Hopefully, one or more of the sourcetypes sticks out to you. Add the sourcetype to the query to hopefully narrow down by host (or, if the problem is universal to all hosts, you now know the sourcetype to investigate). In our case, a few heavy forwarders (e.g. search heads) do not have all of the necessary sourcetypes defined.

index=abc sourcetype=def
| eval latency=_indextime-_time
| stats min(latency),
        max(latency),
        avg(latency),
        median(latency)
    by index sourcetype host

Good luck!

Cheers,
Jacob
Highlighted

Re: What does this message mean regarding the health status of Splunkd?

Explorer

@jacobevans Apologies if I come across dense but I think I'm missing something. I'm very new to this.

I ran the first query to identify the indexes causing the alert. There were a few.
I selected one of the indexes and ran the 2nd search to identify the sourcetypes.

I plugged one of the sourcetypes into the final search to find the hosts.

The search returns 2 hosts but there are three hosts with that app deployed from the deployment server. But no logs from the 3rd server... which I guess means it could still be a timestamp extraction issue, right?

All of the logs with high latency are from one of two sources. But again, no logs from the 2nd source being monitored in that app.

I had a sysadmin check the time on one of the affected hosts and the time matched the current time.

I reran the 3rd search starting with Last 4 Hours and adding 4 hour increments until I got up to last 24 hours.

It looks like the high latency only occurred between 20 and 24 hours ago. 0 - 20 hours ago latency was super duper low.

Is it possible theres something happening on that host which could cause a delay in sending logs to Splunk?

Can Splunk indexers become overloaded with logs being received at the same time and take a while to index them all?

The highest latency as of right now is _time 22:00 02-09-2020 and _indextime of 17:35 02-10-2020. _time and the timestamp in the raw log match.

EDIT: I did some more searching on Google and here within Splunk Answers and remembered we have the monitoring console. I checked Indexing Performance: Instance and Indexing Performance: Advanced.

Within Advanced I found CPU usage and it never gets very much above 10%. Is there something else I should check to eliminate Splunk/the indexers as the cause before asking another team to investigate why their host is delaying the sending of logs?

0 Karma