It appears splunk-optimize is not able to keep up with the amount of tsidx files being created. This particular scenario involves about 10-20GB of data a day. It appears that at least once every 24hrs, all indexing is paused and queues block and the indexer has the following message
idx=main Throttling indexer, too many tsidx files in bucket='/opt/splunk/var/lib/splunk/defaultdb/db/<hot_bucket>'. Is splunk-optimize working? If not, low disk space may be the cause.
This is causing queue saturation and blockage as evidenced in SoS graphs and blocked=true messages. It appears that there are about 3 hot buckets being written to at the time of occurrence. Overall indexing throughput stays about the same.
Further investigation reveals that when this occurs, there are usually 100+ tsidx files in one of the hot buckets. We raised the
maxRunningProcessGroups to 20 from 8 for the indexes in indexes.conf which was the default setting pre-5.0. This appears to help somewhat, but eventually we run into the same issue again. This started after upgrading to 5.0
I would like to know what is the best way to track the performance of splunk-optimize. To my understanding, this process should be tracked by ProcessTracker, but I'm not seeing anything In there besides fsck activities. Shouldn't splunk-optimize be tracked in here as well?
I did notice that in metrics.log under
group=subtaskseconds there is a
throttle_optimize=<value>. I noticed that this value is usually very low except in some cases it spikes up to around 80 seconds. What exactly is this measuring? Am I correct in assuming it measures some type of optimization throttling going on?
We are experiencing the same pattern you just described. We reported this issue to Splunk 2 weeks ago when we upgraded to Splunk 5.0. We increased maxRunningProcessGroups like you explained without luck.
We upgraded to Splunk 5.0.1 and it seems that the issue is not solved yet. Since we have redundant indexers we monitor the number of tsidx per hot bucket and blocked queues. We have been forced to restart splunkd at least every 36 hours and check and repair buckets to avoid indexed data corruption.
In order to understand why splunk-optimize finds itself unable to keep up, I would first recommend to ask yourself these questions:
Impact on indexing:
Are these warnings actually associated with episodes of prolonged saturation of the event-processing queues and/or indexing latency? Or is it just that the error messages are alarming in their nature?
Evolution of the overall indexing throughput at the time of the errors:
Is it peaking? Dropping? Staying the same? The 'Indexing Performance' view of the S.o.S app can help you to find out.
Active hot bucket concurrency:
It is important to determine how many active hot buckets splunkd is concurrently writing to and maintaining during normal operation. The output of
lsof_sos.sh (a scripted input of the S.o.S app can help you there, as it will reference open file descriptors of files located in the hot buckets.
Global system resource usage pattern at the time of the errors, particularly I/O on
Maybe a concurrent activity such as search is competing for I/O bandwidth or CPU at the time of the incidents?
If the process of seeking the answers to these questions doesn't reveal a likely cause (for example, a sporadic resource contention with parallel activities such as search) and if the occurrences of these warnings are frequent and have an impact on your indexing performance , I would recommend to generate a diag of the affected instance and open a case with Splunk support.