Getting Data In

Why has the index process paused data flow? How to handle too many tsidx files?

hrawat
Splunk Employee
Splunk Employee

This issue happens when incoming thruput for hotbuckets is faster than splunk optimize can merge tsidx files and   keep the count < 100(hardcoded). If number of tsidx files per hotbucket are >=100, then indexer will apply indexing pause to allow splunk-optimize catch up.

 

Labels (1)
Tags (1)
0 Karma
1 Solution

hrawat
Splunk Employee
Splunk Employee

Post 7.2 onwards following config should fix the issue

 

In indexes.conf set

[default]

maxRunningProcessGroups=12

processTrackerServiceInterval=0

Update (11/16/2022) If the issue is still not resolved, increase maxRunningProcessGroups setting.
For future splunk 9.1 release and splunk cloud releases, the workaround is not needed as the issue is fixed now.

View solution in original post

hrawat
Splunk Employee
Splunk Employee

Fix will be in next major release 9.1. 
There are multiple reasons for indexing pause.

Do you see this on all indexers all the time?

Do you see this on few indexers at a time but moves around?
Do you see this issue only when few indexers are restarted?


0 Karma

hrawat
Splunk Employee
Splunk Employee

Post 7.2 onwards following config should fix the issue

 

In indexes.conf set

[default]

maxRunningProcessGroups=12

processTrackerServiceInterval=0

Update (11/16/2022) If the issue is still not resolved, increase maxRunningProcessGroups setting.
For future splunk 9.1 release and splunk cloud releases, the workaround is not needed as the issue is fixed now.

edoardo_vicendo
Builder

    @hrawat thanks for the update

we have the same exact issue. I see you mentioned it has been fixed with 9.1, do you mean 9.0.1? The latest available is 9.0.3

We are on prem with 9.0.2 and still facing it despite we already put the indicated set-up in indexes.conf see my question here

https://community.splunk.com/t5/Splunk-Enterprise/The-index-processor-has-paused-data-flow-How-to-op...

Do you suggest increasing maxRunningProcessGroups?

 

0 Karma

hrawat
Splunk Employee
Splunk Employee

Ok you mentioned that in your other post.

We are still facing the following issue when we put in maintenance mode our Indexer Cluster and we stop one Indexer.

Basically all the Indexers stop ingesting data, increasing their queues, waiting for splunk-optimize to finish the job.

This usually happens when we stop the Indexer after a long time since last time.


https://community.splunk.com/t5/Splunk-Enterprise/The-index-processor-has-paused-data-flow-How-to-op...

Do you suggest increasing maxRunningProcessGroups?

 


Stopping one or few indexers causes indexqueue blocked across several indexers. At the same time you see 

throttled: The index processor has paused data flow. Too many tsidx files

 across several indexers.

If this is the case where it takes long time for indexqueue to unblock and indexing throttle to go way.

Try following workaround to reduce outage.

In server.conf
[queue=indexQueue]
maxSize=500MB

In indexes.conf
[default]
throttleCheckPeriod=5
maxConcurrentOptimizes=1
maxRunningProcessGroups=32 
processTrackerServiceInterval=0

 

edoardo_vicendo
Builder

Hello @hrawat 

Really thanks for your reply.

We applied the suggested configuration both on server.conf and indexes.conf

So basically from my understanding the aim is to check for index throttling more often (throttleCheckPeriod=5), with only 1 splunk-optimize running over a bucket (maxConcurrentOptimizes=1), spawning several child processes over it (maxRunningProcessGroups=32) and checking every second if any other child process can be launched (processTrackerServiceInterval=0). Therefore the purpose is to concentrate all the optimize resources on a single bucket per time.

What we observed is the following:

  • if we put the Cluster Master in maintenance and stop an Indexer we do not see anymore the messages on the Monitoring Console. The Indexing Queue fill anyway to 100% on all remaining Indexers even if we have increase the size to 500MB, the Indexing Rate decrease from 15-20 MB/s to 1MB/s, but in short time (approximately few minutes) the problem is solved
  • The side effect we observed is that since we applied the modification we see in Indexers _internal index the following messages (20-30 events per hour):
02-16-2023 17:15:33.556 +0100 INFO  HealthChangeReporter - feature="Index Optimization" indicator="concurrent_optimize_processes_percent" previous_color=green color=yellow due_to_threshold_value=100 measured_value=1 reason="The number of splunk optimize processes is at 100% of the maximum. As a result, the index processor has paused data flow."

02-16-2023 17:15:47.753 +0100 INFO  PeriodicHealthReporter - feature="Index Optimization" color=yellow indicator="concurrent_optimize_processes_percent" due_to_threshold_value=100 measured_value=1 reason="The number of splunk optimize processes is at 100% of the maximum. As a result, the index processor has paused data flow." node_type=indicator node_path=splunkd.index_processor.index_optimization.concurrent_optimize_processes_percent
  • And we also see the following "original" message (1-2 events per day):
02-16-2023 14:45:38.658 +0100 INFO  IndexWriter [12974 indexerPipe] - The index processor has paused data flow. Too many tsidx files in idx=_internal bucket="/xxxxxxx/xxxx/xxxxxxxxxx/splunk/db/_internaldb/db/hot_v1_1928" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

 

It seems to me the direction you given us is the correct one to solve the problem, in fact now it is fine if the Cluster takes just few minutes to recover, before was much longer. What we would like to improve is to avoid during the normal running the PeriodicHealthReporter and HealthChangeReporter messages that inform us the indexing has stopped.

Do you think that we can increase the maxConcurrentOptimizes value to avoid that?

I think in this way we could better balance the "brute force" over more buckets, probably we will loose something when an Indexer is stopped but we gain in normal running.

 

For reference here the indexes.conf specification:

throttleCheckPeriod = <positive integer>
* How frequently, in seconds, that splunkd checks for index throttling
  conditions.
* NOTE: Do not change this setting unless a Splunk Support
  professional asks you to.
* The highest legal value is 4294967295.
* Default: 15
maxConcurrentOptimizes = <nonnegative integer>
* The number of concurrent optimize processes that can run against a hot
  bucket.
* This number should be increased if:
  * There are always many small tsidx files in the hot bucket.
  * After rolling, there are many tsidx files in warm or cold buckets.
* You must restart splunkd after changing this setting. Reloading the
  configuration does not suffice.
* The highest legal value is 4294967295.
* Default: 6
maxRunningProcessGroups = <positive integer>
* splunkd runs helper child processes like "splunk-optimize",
  "recover-metadata", etc. This setting limits how many child processes
  can run at any given time.
* This maximum applies to all of splunkd, not per index. If you have N
  indexes, there will be at most 'maxRunningProcessGroups' child processes,
  not N * 'maxRunningProcessGroups' processes.
* Must maintain maxRunningProcessGroupsLowPriority < maxRunningProcessGroups
* This is an advanced setting; do NOT set unless instructed by Splunk
  Support.
* Highest legal value is 4294967295.
* Default: 8
processTrackerServiceInterval = <nonnegative integer>
* How often, in seconds, the indexer checks the status of the child OS
  processes it has launched to see if it can launch new processes for queued
  requests.
* If set to 0, the indexer checks child process status every second.
* Highest legal value is 4294967295.
* Default: 15

 

Thanks a lot,

Edoardo

0 Karma

edoardo_vicendo
Builder

Hi @hrawat 

We ended up with this configuration:

In server.conf
[queue=indexQueue]
maxSize=500MB

In indexes.conf
[default]
throttleCheckPeriod=5
maxConcurrentOptimizes=2
maxRunningProcessGroups=32 
processTrackerServiceInterval=0

 

In this way we have both the benefits:

  • if we restart the cluster we don't have anymore the IndexWriter message
  • during the normal running we don't have the HealthChangeReporter OR PeriodicHealthReporter messages anymore

Thanks a lot for your suggestion!

0 Karma

mommyfixit
Loves-to-Learn

I have a new deployment of Splunk 9.2.1 Enterprise. We only have the Splunk servers running so far, other than one Universal Forwarder. I'm getting this error:

The index processor has paused data flow. Too many tsidx files in idx=_internal bucket="/opt/splunk/var/lib/splunk/_internaldb/db/hot_v1_57" , waiting for the splunk-optimize indexing helper to catch up merging them. Ensure reasonable disk space is available, and that I/O write throughput is not compromised.

I have 4TB of available disk space, so I have no idea what's going on. Any thoughts?

0 Karma

hrawat
Splunk Employee
Splunk Employee

Or one of the log file under var/log/splunk is flooding.

0 Karma

hrawat
Splunk Employee
Splunk Employee

If you have only one UF, few SHs and still internal index is pausing, it's likely the system is running out of CPU due to high load/search activity or there is some I/O performance issue.

0 Karma

hrawat
Splunk Employee
Splunk Employee

The log message is bit generic. 
The reason for this message is that on that indexer too many internal index log events arrived and as a result there are already 100+ tsidx files for that hot bucket in question. Unless splunk-optimize brings the count below 100, indexer will remain paused.

On the forwarder side make sure not too many events hit the same indexer.
1. On SH/CM/UF you can enable volume based forwarding 

2. From all instances SH/CM/UF/IDX, reduce unwanted metrics.log events

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...