Hello @hrawat_splunk Really thanks for your reply. We applied the suggested configuration both on server.conf and indexes.conf So basically from my understanding the aim is to check for index throttling more often (throttleCheckPeriod=5), with only 1 splunk-optimize running over a bucket (maxConcurrentOptimizes=1), spawning several child processes over it (maxRunningProcessGroups=32) and checking every second if any other child process can be launched (processTrackerServiceInterval=0). Therefore the purpose is to concentrate all the optimize resources on a single bucket per time. What we observed is the following: if we put the Cluster Master in maintenance and stop an Indexer we do not see anymore the messages on the Monitoring Console. The Indexing Queue fill anyway to 100% on all remaining Indexers even if we have increase the size to 500MB, the Indexing Rate decrease from 15-20 MB/s to 1MB/s, but in short time (approximately few minutes) the problem is solved The side effect we observed is that since we applied the modification we see in Indexers _internal index the following messages (20-30 events per hour): 02-16-2023 17:15:33.556 +0100 INFO HealthChangeReporter - feature="Index Optimization" indicator="concurrent_optimize_processes_percent" previous_color=green color=yellow due_to_threshold_value=100 measured_value=1 reason="The number of splunk optimize processes is at 100% of the maximum. As a result, the index processor has paused data flow."
02-16-2023 17:15:47.753 +0100 INFO PeriodicHealthReporter - feature="Index Optimization" color=yellow indicator="concurrent_optimize_processes_percent" due_to_threshold_value=100 measured_value=1 reason="The number of splunk optimize processes is at 100% of the maximum. As a result, the index processor has paused data flow." node_type=indicator node_path=splunkd.index_processor.index_optimization.concurrent_optimize_processes_percent And we also see the following "original" message (1-2 events per day): 02-16-2023 14:45:38.658 +0100 INFO IndexWriter [12974 indexerPipe] - The index processor has paused data flow. Too many tsidx files in idx=_internal bucket="/xxxxxxx/xxxx/xxxxxxxxxx/splunk/db/_internaldb/db/hot_v1_1928" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised. It seems to me the direction you given us is the correct one to solve the problem, in fact now it is fine if the Cluster takes just few minutes to recover, before was much longer. What we would like to improve is to avoid during the normal running the PeriodicHealthReporter and HealthChangeReporter messages that inform us the indexing has stopped. Do you think that we can increase the maxConcurrentOptimizes value to avoid that? I think in this way we could better balance the "brute force" over more buckets, probably we will loose something when an Indexer is stopped but we gain in normal running. For reference here the indexes.conf specification: throttleCheckPeriod = <positive integer>
* How frequently, in seconds, that splunkd checks for index throttling
conditions.
* NOTE: Do not change this setting unless a Splunk Support
professional asks you to.
* The highest legal value is 4294967295.
* Default: 15 maxConcurrentOptimizes = <nonnegative integer>
* The number of concurrent optimize processes that can run against a hot
bucket.
* This number should be increased if:
* There are always many small tsidx files in the hot bucket.
* After rolling, there are many tsidx files in warm or cold buckets.
* You must restart splunkd after changing this setting. Reloading the
configuration does not suffice.
* The highest legal value is 4294967295.
* Default: 6 maxRunningProcessGroups = <positive integer>
* splunkd runs helper child processes like "splunk-optimize",
"recover-metadata", etc. This setting limits how many child processes
can run at any given time.
* This maximum applies to all of splunkd, not per index. If you have N
indexes, there will be at most 'maxRunningProcessGroups' child processes,
not N * 'maxRunningProcessGroups' processes.
* Must maintain maxRunningProcessGroupsLowPriority < maxRunningProcessGroups
* This is an advanced setting; do NOT set unless instructed by Splunk
Support.
* Highest legal value is 4294967295.
* Default: 8 processTrackerServiceInterval = <nonnegative integer>
* How often, in seconds, the indexer checks the status of the child OS
processes it has launched to see if it can launch new processes for queued
requests.
* If set to 0, the indexer checks child process status every second.
* Highest legal value is 4294967295.
* Default: 15 Thanks a lot, Edoardo
... View more