Getting Data In

Indexer stops sending back ACK

verbal_666
Builder

Hi.

During the day, some on my Indexers completely stops sending back the ACK, so many agents keep data in queue until there's the ACK and the flow restarts (in some cases also 15/20 minutes passes!!!). Meanwhile, obviously, i have many delays of data and ACK errors.

This happens at some hours, from 09:00 to 17:00, during very high data ingestion the issue is clear visible, during the other hours is trasparent, no issue (few data flowing and few users interaction).

I'm wondering, maybe an Indexer internal task to manage indexes/buckets, to optimize system and manage retentions? If so, is this task "editable" to run "once per day only" (in night hours)?

Thanks.

Labels (2)
0 Karma
1 Solution

verbal_666
Builder

The issue is definitevely that i have to add some indexers and maybe also 1 or 2 SH to cluster.

Infrastructure is currently undersized, it can't manage all actual data and jobs.

Due to a very high data burst during office time (9 to 17), delays (for very very massive log files) and cpu saturation indexers side, infrastructure can't manage all data/users interaction/scheduled jobs all at once. So Indexers stop responding durings some times.

Pipelines is 1, if i raise it to 2 System collapses. Monitoring.Console delined some heavy queries during that time.range that also writes directly on some indexes. But i have my own Dashboard Monitoring.Console on SHs that delines a strong delay for heavy logs (from 15 to 90 minutes before they reach the 0 minutes delay and indexes can write queues), some blocked queues (i have 1000MB size for many queue set) and all that can the easily delines a collapsing infrastructure 🤷‍♂️

Infrastructure grew last months, so it's time to add some servers. I began with a 2 Indexers, then 4, now i really have to go to 8/12. Also Splunk Best-Practices suggests a 12 Indexers Infrastructure for my actual data flow (2-3 TB per day).

Meanwhile, i fixed actual situation disabling heavy logs and heavy jobs on SHs 🤷‍♂️ i also lowered the thruput for UFs, from maximum to 10MB/s. System works, but disabling some features and data.

Thanks all.

View solution in original post

0 Karma

livehybrid
SplunkTrust
SplunkTrust

Hi @verbal_666 

This does sound like a resourcing availability issue.

Please can you check the Monitoring Console https://<yourSplunkEnv>:<port>/en-US/app/splunk_monitoring_console/indexing_performance_instance

This should highlight and blockage in the pipeline. In the meantime could you also confirm the number of parallelIngestionPipelines set in server.conf? I'd suggest using btool for this:

$SPLUNK_HOME/bin/splunk cmd btool server list --debug general

What value do you have for parallelIngestionPipelines?

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

0 Karma

verbal_666
Builder

The issue is definitevely that i have to add some indexers and maybe also 1 or 2 SH to cluster.

Infrastructure is currently undersized, it can't manage all actual data and jobs.

Due to a very high data burst during office time (9 to 17), delays (for very very massive log files) and cpu saturation indexers side, infrastructure can't manage all data/users interaction/scheduled jobs all at once. So Indexers stop responding durings some times.

Pipelines is 1, if i raise it to 2 System collapses. Monitoring.Console delined some heavy queries during that time.range that also writes directly on some indexes. But i have my own Dashboard Monitoring.Console on SHs that delines a strong delay for heavy logs (from 15 to 90 minutes before they reach the 0 minutes delay and indexes can write queues), some blocked queues (i have 1000MB size for many queue set) and all that can the easily delines a collapsing infrastructure 🤷‍♂️

Infrastructure grew last months, so it's time to add some servers. I began with a 2 Indexers, then 4, now i really have to go to 8/12. Also Splunk Best-Practices suggests a 12 Indexers Infrastructure for my actual data flow (2-3 TB per day).

Meanwhile, i fixed actual situation disabling heavy logs and heavy jobs on SHs 🤷‍♂️ i also lowered the thruput for UFs, from maximum to 10MB/s. System works, but disabling some features and data.

Thanks all.

0 Karma

thahir
Communicator

Hi @verbal_666 ,

if the indexer resource usage is stable and this happen periodically, this indicates a network issue.

Try to capture a pcap during the delay window and check for the dropped ack's and engage the network team or firewall team and try to do some analyze on the traffic and session timeouts, it could be affecting splunk traffic.

0 Karma

PrewinThomas
Motivator

@verbal_666 

Splunk doesn’t offer a built-in scheduler for bucket management tasks like rolling or retention.
I would say focus on resource monitoring, and possibly scaling your indexer infrastructure, not on manipulating Splunk's internal maintenance timing.

But you can consider below possible tuning, but not a recommended approach.

-Tune max_peer_rep_load and max_peer_build_load in server.conf reduce these values to throttle replication

-Adjust forwarder behavior by editing autoLBFrequency - reduces how often forwarders switch indexers, lowering channel creation rate


#https://community.splunk.com/t5/Getting-Data-In/Why-did-ingestion-slow-way-down-after-I-added-thousa...


Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!

verbal_666
Builder

The strange thing is that the resource usage is quite equal all time from 09 to 17, with some "normal" CPU peak (i have to add some Indexers asap), also same number of searches, and quality of searches (none of them seems to create some loop or resources bottleneck!!!).

I was also wondering if some Network Device makes some "refresh" (every 1 hour), maybe breaking the Indexers ACK responses 🤷‍♂️🤷‍♂️🤷‍♂️quite strange...

0 Karma

PrewinThomas
Motivator

I agreee with you on that, if your CPU, IOPS, and searches all seem steady.

Some network appliances have default TCP session timeout, If forwarder/indexer sessions idle or ACKs delay just enough, the connection may be dropped, forcing re-establishment and buffering.

Also network switches/routers might prune idle TCP flows, this affects forwarders that don’t constantly send.


Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!

Get Updates on the Splunk Community!

Congratulations to the 2025-2026 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Community Feedback

We Want to Hear from You! Share Your Feedback on the Splunk Community   The Splunk Community is built for you ...

Manual Instrumentation with Splunk Observability Cloud: Implementing the ...

In our observability journey so far, we've built comprehensive instrumentation for our Worms in Space ...