This issue happens when incoming thruput for hotbuckets is faster than splunk optimize can merge tsidx files and keep the count < 100(hardcoded). If number of tsidx files per hotbucket are >=100, then indexer will apply indexing pause to allow splunk-optimize catch up.
Post 7.2 onwards following config should fix the issue
In indexes.conf set
[default]
maxRunningProcessGroups=12
processTrackerServiceInterval=0
Update (11/16/2022) If the issue is still not resolved, increase maxRunningProcessGroups setting.
For future splunk 9.1 release and splunk cloud releases, the workaround is not needed as the issue is fixed now.
Fix will be in next major release 9.1.
There are multiple reasons for indexing pause.
Do you see this on all indexers all the time?
Do you see this on few indexers at a time but moves around?
Do you see this issue only when few indexers are restarted?
Post 7.2 onwards following config should fix the issue
In indexes.conf set
[default]
maxRunningProcessGroups=12
processTrackerServiceInterval=0
Update (11/16/2022) If the issue is still not resolved, increase maxRunningProcessGroups setting.
For future splunk 9.1 release and splunk cloud releases, the workaround is not needed as the issue is fixed now.
@hrawat thanks for the update
we have the same exact issue. I see you mentioned it has been fixed with 9.1, do you mean 9.0.1? The latest available is 9.0.3
We are on prem with 9.0.2 and still facing it despite we already put the indicated set-up in indexes.conf see my question here
Do you suggest increasing maxRunningProcessGroups?
Ok you mentioned that in your other post.
We are still facing the following issue when we put in maintenance mode our Indexer Cluster and we stop one Indexer.
Basically all the Indexers stop ingesting data, increasing their queues, waiting for splunk-optimize to finish the job.
This usually happens when we stop the Indexer after a long time since last time.
Do you suggest increasing maxRunningProcessGroups?
Stopping one or few indexers causes indexqueue blocked across several indexers. At the same time you see
throttled: The index processor has paused data flow. Too many tsidx files
across several indexers.
If this is the case where it takes long time for indexqueue to unblock and indexing throttle to go way.
Try following workaround to reduce outage.
In server.conf
[queue=indexQueue]
maxSize=500MB
In indexes.conf
[default]
throttleCheckPeriod=5
maxConcurrentOptimizes=1
maxRunningProcessGroups=32
processTrackerServiceInterval=0
Hello @hrawat
Really thanks for your reply.
We applied the suggested configuration both on server.conf and indexes.conf
So basically from my understanding the aim is to check for index throttling more often (throttleCheckPeriod=5), with only 1 splunk-optimize running over a bucket (maxConcurrentOptimizes=1), spawning several child processes over it (maxRunningProcessGroups=32) and checking every second if any other child process can be launched (processTrackerServiceInterval=0). Therefore the purpose is to concentrate all the optimize resources on a single bucket per time.
What we observed is the following:
02-16-2023 17:15:33.556 +0100 INFO HealthChangeReporter - feature="Index Optimization" indicator="concurrent_optimize_processes_percent" previous_color=green color=yellow due_to_threshold_value=100 measured_value=1 reason="The number of splunk optimize processes is at 100% of the maximum. As a result, the index processor has paused data flow."
02-16-2023 17:15:47.753 +0100 INFO PeriodicHealthReporter - feature="Index Optimization" color=yellow indicator="concurrent_optimize_processes_percent" due_to_threshold_value=100 measured_value=1 reason="The number of splunk optimize processes is at 100% of the maximum. As a result, the index processor has paused data flow." node_type=indicator node_path=splunkd.index_processor.index_optimization.concurrent_optimize_processes_percent
02-16-2023 14:45:38.658 +0100 INFO IndexWriter [12974 indexerPipe] - The index processor has paused data flow. Too many tsidx files in idx=_internal bucket="/xxxxxxx/xxxx/xxxxxxxxxx/splunk/db/_internaldb/db/hot_v1_1928" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.
It seems to me the direction you given us is the correct one to solve the problem, in fact now it is fine if the Cluster takes just few minutes to recover, before was much longer. What we would like to improve is to avoid during the normal running the PeriodicHealthReporter and HealthChangeReporter messages that inform us the indexing has stopped.
Do you think that we can increase the maxConcurrentOptimizes value to avoid that?
I think in this way we could better balance the "brute force" over more buckets, probably we will loose something when an Indexer is stopped but we gain in normal running.
For reference here the indexes.conf specification:
throttleCheckPeriod = <positive integer>
* How frequently, in seconds, that splunkd checks for index throttling
conditions.
* NOTE: Do not change this setting unless a Splunk Support
professional asks you to.
* The highest legal value is 4294967295.
* Default: 15
maxConcurrentOptimizes = <nonnegative integer>
* The number of concurrent optimize processes that can run against a hot
bucket.
* This number should be increased if:
* There are always many small tsidx files in the hot bucket.
* After rolling, there are many tsidx files in warm or cold buckets.
* You must restart splunkd after changing this setting. Reloading the
configuration does not suffice.
* The highest legal value is 4294967295.
* Default: 6
maxRunningProcessGroups = <positive integer>
* splunkd runs helper child processes like "splunk-optimize",
"recover-metadata", etc. This setting limits how many child processes
can run at any given time.
* This maximum applies to all of splunkd, not per index. If you have N
indexes, there will be at most 'maxRunningProcessGroups' child processes,
not N * 'maxRunningProcessGroups' processes.
* Must maintain maxRunningProcessGroupsLowPriority < maxRunningProcessGroups
* This is an advanced setting; do NOT set unless instructed by Splunk
Support.
* Highest legal value is 4294967295.
* Default: 8
processTrackerServiceInterval = <nonnegative integer>
* How often, in seconds, the indexer checks the status of the child OS
processes it has launched to see if it can launch new processes for queued
requests.
* If set to 0, the indexer checks child process status every second.
* Highest legal value is 4294967295.
* Default: 15
Thanks a lot,
Edoardo
Hi @hrawat
We ended up with this configuration:
In server.conf
[queue=indexQueue]
maxSize=500MB
In indexes.conf
[default]
throttleCheckPeriod=5
maxConcurrentOptimizes=2
maxRunningProcessGroups=32
processTrackerServiceInterval=0
In this way we have both the benefits:
Thanks a lot for your suggestion!
I have a new deployment of Splunk 9.2.1 Enterprise. We only have the Splunk servers running so far, other than one Universal Forwarder. I'm getting this error:
The index processor has paused data flow. Too many tsidx files in idx=_internal bucket="/opt/splunk/var/lib/splunk/_internaldb/db/hot_v1_57" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.
I have 4TB of available disk space, so I have no idea what's going on. Any thoughts?
Or one of the log file under var/log/splunk is flooding.
If you have only one UF, few SHs and still internal index is pausing, it's likely the system is running out of CPU due to high load/search activity or there is some I/O performance issue.
The log message is bit generic.
The reason for this message is that on that indexer too many internal index log events arrived and as a result there are already 100+ tsidx files for that hot bucket in question. Unless splunk-optimize brings the count below 100, indexer will remain paused.
On the forwarder side make sure not too many events hit the same indexer.
1. On SH/CM/UF you can enable volume based forwarding
2. From all instances SH/CM/UF/IDX, reduce unwanted metrics.log events