Archive

Forwarding to indexer group [myGroupName] blocked for 10 seconds

R_B
Path Finder

I am getting a major problem on my systems that I have been troubleshooting for two days and cannot figure out. The problem I am receiving is "Forwarding to indexer group [myGroupName] blocked for 10 seconds". This is preventing data from being forwarded to the indexers and indexed.

In my environment, I have 2 clustered Search Heads and a Deployer along with 2 clustered Indexers and a Cluster Master. I have one Heavy Forwarder. One Deployment Server. One License Master. One syslog server that is forwarding syslog data to the Indexers. I see the error whenever I log into the web interface for any of the Splunk servers.

I cannot figure out how to fix this, any help would be greatly appreciated.

0 Karma
1 Solution

jhupka
Path Finder

Do you have the Splunk Monitoring Console (splunk_monitoring_console app) configured on one of your instances (not your SH Cluster)? If you do not, start by configuring that:

http://docs.splunk.com/Documentation/Splunk/latest/DMC/

Once it is, go to the Indexing menu and start looking through those pages, especially the individual instance page, to see how things are performing. Are particular queues filling up? Running into memory issues? CPU issues?

Also, if you search for any errors that your Forwarders are reporting, what things proceed the blocked message you outline above?

Essentially what you find from these questions will help narrow down the root cause. For example, if your Merging Pipeline on the Indexer(s) are always at 100%, then that will bubble back to the Forwarders since the IDX will say, "Hey, I'm backed up...stop sending data..." Merging Pipeline issues in turn have very specific root-causes, such as relying on the default event breaking and merging settings for sourcetypes defined in your props.conf instead of setting stuff like LINE_BREAKER and SHOULD_LINEMERGE manually.

View solution in original post

sloshburch
Ultra Champion

Heads up about "2 clustered Search Heads and a Deployer" - SHC work best with odd numbers and a min of three search heads. While not required, you can learn more about the pro/cons here: http://docs.splunk.com/Documentation/Splunk/latest/DistSearch/SHCarchitecture#Captain_election_proce...

R_B
Path Finder

Hi SloshBurch,

Yes, I've been told that it is best to have at least 3 search heads, and is something I am considering. Thank you for the advice!

0 Karma

jhupka
Path Finder

Do you have the Splunk Monitoring Console (splunk_monitoring_console app) configured on one of your instances (not your SH Cluster)? If you do not, start by configuring that:

http://docs.splunk.com/Documentation/Splunk/latest/DMC/

Once it is, go to the Indexing menu and start looking through those pages, especially the individual instance page, to see how things are performing. Are particular queues filling up? Running into memory issues? CPU issues?

Also, if you search for any errors that your Forwarders are reporting, what things proceed the blocked message you outline above?

Essentially what you find from these questions will help narrow down the root cause. For example, if your Merging Pipeline on the Indexer(s) are always at 100%, then that will bubble back to the Forwarders since the IDX will say, "Hey, I'm backed up...stop sending data..." Merging Pipeline issues in turn have very specific root-causes, such as relying on the default event breaking and merging settings for sourcetypes defined in your props.conf instead of setting stuff like LINE_BREAKER and SHOULD_LINEMERGE manually.

View solution in original post

R_B
Path Finder

Hi jhupka, thank you very much for the response.

I do have the Splunk Monitoring Console configured on the license master. Looking through the Indexing menu, I saw that in the data pipeline for my two indexers the fill ratios are 100% for all 4 Queues (Parsing Queue, Aggregator Queue, Typing Queue, and Index Queue). None of the indexes are 100% full, except one of the indexes is over 99% full. I don't appear to be having any CPU or memory issues.

Here is a section of splunkd.log (I replaced sensitive information with a description of what it is)
05-30-2017 14:59:27.656 -0400 WARN TcpOutputProc - Cooked connection to ip=Indexer#1IP:Port# timed out
05-30-2017 14:59:30.058 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 10 seconds.
05-30-2017 14:59:40.069 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 20 seconds.
05-30-2017 14:59:50.081 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 30 seconds.
05-30-2017 15:00:00.093 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 40 seconds.
05-30-2017 15:00:05.140 -0400 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_ClusterMasterIP_Port#_ClusterMasterIP_ClusterMasterHostName_80755AFE-FDB3-4B2A-8CEE-8B2DB16B69D4
05-30-2017 15:00:14.009 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 10 seconds.
05-30-2017 15:00:24.021 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 20 seconds.
05-30-2017 15:00:34.032 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 30 seconds.
05-30-2017 15:00:38.637 -0400 WARN TcpOutputProc - Read operation timed out expecting ACK from Indexer#2:Port# in 300 seconds.
05-30-2017 15:00:38.637 -0400 WARN TcpOutputProc - Possible duplication of events with channel=source::/opt/splunk/var/log/splunk/splunkd.log|host::ClusterMasterHostName|splunkd|199, streamId=2032662337362434837, offset=21761155 subOffset=199 on host=Indexer#2:Port#
05-30-2017 15:00:38.637 -0400 WARN TcpOutputProc - Possible duplication of events with channel=source::audittrail|host::ClusterMasterHostName|audittrail|, streamId=0, offset=0 on host=Indexer#2IP:Port#
05-30-2017 15:00:44.043 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 40 seconds.
05-30-2017 15:00:54.054 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 50 seconds.
05-30-2017 15:00:58.559 -0400 WARN TcpOutputProc - Cooked connection to ip=Indexer#2:Port# timed out
05-30-2017 15:01:04.067 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 60 seconds.
05-30-2017 15:01:05.147 -0400 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_ClusterMasterIP_Port#_ClusterMasterIP_ClusterMasterHostName_80755AFE-FDB3-4B2A-8CEE-8B2DB16B69D4
05-30-2017 15:01:14.078 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 70 seconds.

So, I'm assuming that the problem here is the 4 data pipeline queues are filling to max capacity, right? How would that be happening though, and how could I fix that and prevent it from happening again?

0 Karma

jhupka
Path Finder

When you look at the queue order of Parsing Queue - Aggregator Queue - Typing Queue - Index Queue, the one farthest to the right that is full most likely has the root cause, and then things backup to the left (and once the Parsing Queue is full, that's when it affects the Forwarders).

Since the Index Queue is full the culprit often comes down to read/write issues on the Indexer - writing to disk is what happens after the Index Queue. These are the questions I would ask of your environment next:

  • Are you running out of disk space on your Indexers?
  • Are your events coming in from Forwarders too fast or too much data, and the IOPS you have available not enough for the two Indexers you have?
  • If you take a look in the DMC for License Usage, what are your trends and how does that compare to when the Indexers have a problem?

As a side note, I would definitely look into adding a third Indexer no matter what. You have all of these other supporting Splunk instances around your Index Cluster, but if one goes down you immediately lose 50% of your capacity. Adding a third will give you that much more headroom. Also, Splunk scales wonderfully horizontal at the Indexer layer...it is cheap/easy to improve all aspects of your environment by adding more there (indexing speed, search speed, disk space, etc).

R_B
Path Finder

SloshBurch, thank you for pointing this out. I have not opened a case before as I'm still rather new with Splunk, and did not think about doing so.

jhupka, that is a good point with losing 50% capacity with only 2 indexers, that is something that I will be considering. I did some more troubleshooting and think I fixed the problem, as I have not gotten any errors or other problems since implementing the fix. That makes sense that when the index queue is full, the other queues get backed up, I just don't understand how the problem was causing the index or other queues to be completely full and backed up. Perhaps you have some insight or better understanding on the situation?

So what I discovered was when both of my search heads were active, I had the problem of all queues being full and all data being blocked from being forwarded to the indexers. However, if stopped the splunk service from running on just one of the search heads (essentially shutting off one search head and having just one search head running, did not matter which search head), the queues would empty and I had no problems. When both were running, in addition to having all the data being blocked, I had an error on the search heads stating that the configuration bundle was being timed out when it attempted to replicate to the search peers (the default time of 60 seconds, as set with sendRcvTimeout in distsearch.conf). So, I changed the sendRcvTimeout from the default 60 seconds to 300 seconds (5 minutes) on the search heads, restarted the splunk service on the search heads, and now everything has been working without errors since. I'm able to search on both search heads and ingest data without any errors. So two questions I don't understand is what configuration bundle is being replicated from the search heads to the search peers, and how is that causing all the queues starting with the index queue to fill up and not empty?

0 Karma

sloshburch
Ultra Champion

Very interesting. This is def in the realm of a support case (remember, you've already paid for support, so might as well use it).

Whenever I hear of limits like sendRcvTimeout being changed, I get worried. In my experiences, needing to change a setting like that is indicative of a larger problem and by weakening a limit, you essentially just ignore the symptoms of the problem and can be mislead into thinking the problem is gone.

From what you've described, it sounds like the two symptoms we are seeing are that (1) the indexers are getting backed up writing to disk AND (2) the search heads are taking a while to complete sending bundles to the indexer. I'm going to infer that the indexer is also taking too long to write the bundle to disk, hence the search head taking a while to finish sending the bundle.

The next thing I'd look at are:

  1. what's the IOPS on the indexer's disks. If the indexer can't read/write fast enough from that storage then you've likely got the problem right there
  2. Is the search head trying to push a massive bundle to the indexers such that it's holding up the indexers?
  3. Has other configuration in the env been set up such that the indexers are being tasked with more work than they should be doing. Support could help coach through that OR look at what settings you have in place that are NOT part of the system/default. (Hint: btool)

These are mutually exclusive. In other words, you could be experiencing BOTH problems 1 and 2, not just 1 or 2.

R_B
Path Finder

That's a good point, wouldn't hurt to use the support! That makes sense that the sendRcvTimeout might not be the real root of the problem. Those two symptoms sounds pretty accurate, I'm going to dig some more into this. Thank you very much for the all the insight and help!

0 Karma

sloshburch
Ultra Champion

BTW: This also sounds like a great candidate for a support case so you can do webex and learn in real time from the talented Splunk support team.