Purge queue on forwarder / indexer down

nocostk · ‎07-20-2012

I had one of my indexers go down a couple weeks back. Since then each of my forwarders is trying to send events to that indexer but failing with errors like

WARN  TcpOutputFd - Connect to 10.1.4.183:9998 failed. Connection refused

So I modified my outputs.conf to remove that target indexer and restarted the forwarder (heavy). However, I'm still seeing that error. Also I'm seeing queueing errors on the forwarder:

INFO  TailingProcessor - Could not send data to output queue (parsingQueue), retrying...

I'm thinking that the queue has retained the old indexer and is continuing to attempt event delivery. As I noted, cycling splunkd on the forwarder doesn't make a difference. I also think this is causing delays in sending events to my other indexers (5-15 minutes will go by before any events show up).

How can I alleviate this problem (aside from standing up an indexer on the failed IP noted above)?

sloshburch · ‎11-21-2014

We found that a couple of things were causing such issues. These are not necessarily the same issue you're seeing.

I did some math and realized that we had some blocking because our Universal Forwarder was hitting its default limits.conf

[thruput]
maxKBps = 256

So we changed that to 0, which makes it unlimited. Keep in mind this impacts CPU on the host system where the forwarder lives.

This allowed the forwarder to catch up to itself. I was then able to analyze the metrics.log on the forwarder to see what thruput was required based on actual (the other option is to do some math and figure out how much thruput you need).
The other thing was that we had to disable useACK in my forwarder's outputs.conf so its

[tcpout:mygroup]
useACK = false

This was because the ACKs caused even more thruput and pauses.

So in conclusion, check out the metrics.log and take a hard look at where the pipeline is backing up.

Hopefully that helps you as well?!
,We found that a couple of things were causing such issues. These are not necessarily the same issue you're seeing.

I did some math and realized that we had some blocking because our Universal Forwarder was hitting its default limits.conf

[thruput]
maxKBps = 256

So we changed that to 0, which makes it unlimited. Keep in mind this impacts CPU on the host system where the forwarder lives.

This allowed the forwarder to catch up to itself. I was then able to analyze the metrics.log on the forwarder to see what thruput was required based on actual (the other option is to do some math and figure out how much thruput you need).
The other thing was that we had to disable useACK in my forwarder's outputs.conf so its

[tcpout:mygroup]
useACK = false

This was because the ACKs caused even more thruput and pauses.

So in conclusion, check out the metrics.log and take a hard look at where the pipeline is backing up.

Hopefully that helps you as well?!

sloshburch · ‎10-21-2014

Did you find a solution for this? I think I'm seeing the same problem.

tkiss · ‎11-21-2014

Any solution or comment on this? We are in the same situation

aljohnson_splun · ‎11-21-2014

Voting the question up is one way of saying you think this is important.

Purge queue on forwarder / indexer down

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

Are you a member of the Splunk Community?

Purge queue on forwarder / indexer down

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Shape the Future of Splunk: Join the Product Research Lab!

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal