Solved: Why is heavy forwarder repeatedly getting "WARN Tc...

hagjos43 · ‎09-09-2014

We are seeing the following errors on our Heavy Forwarder side:

09-05-2014 13:39:06.483 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:39:06.484 - 0400 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 600 seconds.
09-05-2014 13:39:36.493 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:40:06.501 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:40:36.509 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:40:39.510 - 0400 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 700 seconds.
09-05-2014 13:41:06.517 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:41:36.524 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:42:06.533 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:42:19.536 - 0400 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 800 seconds.

This continues to repeat through the current date. Anyone else experience this or have any suggestions?

masonmorales · ‎01-29-2016

From my experience, this is usually due to blocked queues at the indexers. The most common cause is insufficient IOPS/throughput at the indexers' disk subsystem. When a queue is full for a certain length of time on the indexer, the indexer will start rejecting forwarder connections so that it can clear its full queue(s) before processing new events.

Here are some searches you can run against the _internal index of your indexers to find and see the bottleneck:

View the current queue size:

index=_internal source=*metrics.log group=queue | timechart median(current_size) by name

Find blocked queue events:

index=_internal source=*metrics.log group=queue blocked
Blocked queues in last 24 hours by queue and Splunk server: 
index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size)  | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | eval name=host.":".name | where fill_perc>=99.0 | timechart max(fill_perc) as MaxFillPerc by name useother=false limit=100 minspan=1h

Count how many times queues were >=99% by Queue Name and Splunk Server

index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size) | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | where fill_perc>=99.0 | stats count by name host  | eval name=case(name=="aggqueue","2 - Aggregation Queue",name=="indexqueue","4 - Indexing Queue",name=="parsingqueue","1 - Parsing Queue",name=="typingqueue","3 - Typing Queue", 1=1, name)

View solution in original post

PGrantham · ‎01-29-2016

Try checking your metrics.log on both your HF and indexer.

Do you see any blocked queues (like the parsingqueue or aggqueue or tcpinqueue)?

masonmorales · ‎01-29-2016

From my experience, this is usually due to blocked queues at the indexers. The most common cause is insufficient IOPS/throughput at the indexers' disk subsystem. When a queue is full for a certain length of time on the indexer, the indexer will start rejecting forwarder connections so that it can clear its full queue(s) before processing new events.

Here are some searches you can run against the _internal index of your indexers to find and see the bottleneck:

View the current queue size:

index=_internal source=*metrics.log group=queue | timechart median(current_size) by name

Find blocked queue events:

index=_internal source=*metrics.log group=queue blocked
Blocked queues in last 24 hours by queue and Splunk server: 
index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size)  | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | eval name=host.":".name | where fill_perc>=99.0 | timechart max(fill_perc) as MaxFillPerc by name useother=false limit=100 minspan=1h

Count how many times queues were >=99% by Queue Name and Splunk Server

index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size) | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | where fill_perc>=99.0 | stats count by name host  | eval name=case(name=="aggqueue","2 - Aggregation Queue",name=="indexqueue","4 - Indexing Queue",name=="parsingqueue","1 - Parsing Queue",name=="typingqueue","3 - Typing Queue", 1=1, name)

bohanlon_splunk · ‎01-29-2016

Have you a Forwarder Loop?
https://answers.splunk.com/answers/217915/splunk-app-for-windows-infrastructure-forwarding-t.html

khourihan_splun · ‎09-28-2015

See this post for step to troubleshoot: http://answers.splunk.com/answers/189238/how-to-troubleshoot-error-on-splunk-6-universal-fo.html

but in general I'd use Splunk on Splunk (SoS) app to diagnose where the bottleneck is. If you are running 6.3, you can use the DMC (Distributed Management Console) to do the same analysis: Goto Setting and click Distribute Management Console icon on the left.

cdupuis123 · ‎03-05-2015

Hi inters

Yes I've spent time on the answers site with similar results, but after using/running Splunk now for 3 years I've found that if I can't get the answer from Splunk answers I've either used the wrong search term, or most times I find something close and am able to backwards/sideways engineer it until it fixes my issue. Oh course if all else fails call my SE or Support. Good luck with your POC

inters · ‎03-03-2015

I am currently evaluating Splunk. Ceaselessly, I encounter errors like this and "answers.splunk.com" has no answers, only other frustrated questioners.

Why does anyone use this software???

satishsdange · ‎03-03-2015

Please post your questions..I am sure you will get answers.

djfisher · ‎02-20-2015

Same here,, started happening. Is it due to bad band width or to many seconds between collections? I use the 9*Nix app to collect audit logs using rlog.sh

cdupuis123 · ‎01-12-2015

I don't have the answer, but I've got the same issue!!!! Anyone????

Why is heavy forwarder repeatedly getting "WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 600 seconds."

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Announcing Modern Navigation: A New Era of Splunk User Experience

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

Join the Conversation