I am getting a major problem on my systems that I have been troubleshooting for two days and cannot figure out. The problem I am receiving is "Forwarding to indexer group [myGroupName] blocked for 10 seconds". This is preventing data from being forwarded to the indexers and indexed.
In my environment, I have 2 clustered Search Heads and a Deployer along with 2 clustered Indexers and a Cluster Master. I have one Heavy Forwarder. One Deployment Server. One License Master. One syslog server that is forwarding syslog data to the Indexers. I see the error whenever I log into the web interface for any of the Splunk servers.
I cannot figure out how to fix this, any help would be greatly appreciated.
Do you have the Splunk Monitoring Console (splunk_monitoring_console app) configured on one of your instances (not your SH Cluster)? If you do not, start by configuring that:
http://docs.splunk.com/Documentation/Splunk/latest/DMC/
Once it is, go to the Indexing menu and start looking through those pages, especially the individual instance page, to see how things are performing. Are particular queues filling up? Running into memory issues? CPU issues?
Also, if you search for any errors that your Forwarders are reporting, what things proceed the blocked message you outline above?
Essentially what you find from these questions will help narrow down the root cause. For example, if your Merging Pipeline on the Indexer(s) are always at 100%, then that will bubble back to the Forwarders since the IDX will say, "Hey, I'm backed up...stop sending data..." Merging Pipeline issues in turn have very specific root-causes, such as relying on the default event breaking and merging settings for sourcetypes defined in your props.conf instead of setting stuff like LINE_BREAKER and SHOULD_LINEMERGE manually.
Heads up about "2 clustered Search Heads and a Deployer" - SHC work best with odd numbers and a min of three search heads. While not required, you can learn more about the pro/cons here: http://docs.splunk.com/Documentation/Splunk/latest/DistSearch/SHCarchitecture#Captain_election_proce...
Hi SloshBurch,
Yes, I've been told that it is best to have at least 3 search heads, and is something I am considering. Thank you for the advice!
Do you have the Splunk Monitoring Console (splunk_monitoring_console app) configured on one of your instances (not your SH Cluster)? If you do not, start by configuring that:
http://docs.splunk.com/Documentation/Splunk/latest/DMC/
Once it is, go to the Indexing menu and start looking through those pages, especially the individual instance page, to see how things are performing. Are particular queues filling up? Running into memory issues? CPU issues?
Also, if you search for any errors that your Forwarders are reporting, what things proceed the blocked message you outline above?
Essentially what you find from these questions will help narrow down the root cause. For example, if your Merging Pipeline on the Indexer(s) are always at 100%, then that will bubble back to the Forwarders since the IDX will say, "Hey, I'm backed up...stop sending data..." Merging Pipeline issues in turn have very specific root-causes, such as relying on the default event breaking and merging settings for sourcetypes defined in your props.conf instead of setting stuff like LINE_BREAKER and SHOULD_LINEMERGE manually.
Hi jhupka, thank you very much for the response.
I do have the Splunk Monitoring Console configured on the license master. Looking through the Indexing menu, I saw that in the data pipeline for my two indexers the fill ratios are 100% for all 4 Queues (Parsing Queue, Aggregator Queue, Typing Queue, and Index Queue). None of the indexes are 100% full, except one of the indexes is over 99% full. I don't appear to be having any CPU or memory issues.
Here is a section of splunkd.log (I replaced sensitive information with a description of what it is)
05-30-2017 14:59:27.656 -0400 WARN TcpOutputProc - Cooked connection to ip=Indexer#1IP:Port# timed out
05-30-2017 14:59:30.058 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 10 seconds.
05-30-2017 14:59:40.069 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 20 seconds.
05-30-2017 14:59:50.081 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 30 seconds.
05-30-2017 15:00:00.093 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 40 seconds.
05-30-2017 15:00:05.140 -0400 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_ClusterMasterIP_Port#_ClusterMasterIP_ClusterMasterHostName_80755AFE-FDB3-4B2A-8CEE-8B2DB16B69D4
05-30-2017 15:00:14.009 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 10 seconds.
05-30-2017 15:00:24.021 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 20 seconds.
05-30-2017 15:00:34.032 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 30 seconds.
05-30-2017 15:00:38.637 -0400 WARN TcpOutputProc - Read operation timed out expecting ACK from Indexer#2:Port# in 300 seconds.
05-30-2017 15:00:38.637 -0400 WARN TcpOutputProc - Possible duplication of events with channel=source::/opt/splunk/var/log/splunk/splunkd.log|host::ClusterMasterHostName|splunkd|199, streamId=2032662337362434837, offset=21761155 subOffset=199 on host=Indexer#2:Port#
05-30-2017 15:00:38.637 -0400 WARN TcpOutputProc - Possible duplication of events with channel=source::audittrail|host::ClusterMasterHostName|audittrail|, streamId=0, offset=0 on host=Indexer#2IP:Port#
05-30-2017 15:00:44.043 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 40 seconds.
05-30-2017 15:00:54.054 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 50 seconds.
05-30-2017 15:00:58.559 -0400 WARN TcpOutputProc - Cooked connection to ip=Indexer#2:Port# timed out
05-30-2017 15:01:04.067 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 60 seconds.
05-30-2017 15:01:05.147 -0400 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_ClusterMasterIP_Port#_ClusterMasterIP_ClusterMasterHostName_80755AFE-FDB3-4B2A-8CEE-8B2DB16B69D4
05-30-2017 15:01:14.078 -0400 WARN TcpOutputProc - Forwarding to indexer group myGroupName blocked for 70 seconds.
So, I'm assuming that the problem here is the 4 data pipeline queues are filling to max capacity, right? How would that be happening though, and how could I fix that and prevent it from happening again?
When you look at the queue order of Parsing Queue - Aggregator Queue - Typing Queue - Index Queue, the one farthest to the right that is full most likely has the root cause, and then things backup to the left (and once the Parsing Queue is full, that's when it affects the Forwarders).
Since the Index Queue is full the culprit often comes down to read/write issues on the Indexer - writing to disk is what happens after the Index Queue. These are the questions I would ask of your environment next:
As a side note, I would definitely look into adding a third Indexer no matter what. You have all of these other supporting Splunk instances around your Index Cluster, but if one goes down you immediately lose 50% of your capacity. Adding a third will give you that much more headroom. Also, Splunk scales wonderfully horizontal at the Indexer layer...it is cheap/easy to improve all aspects of your environment by adding more there (indexing speed, search speed, disk space, etc).
SloshBurch, thank you for pointing this out. I have not opened a case before as I'm still rather new with Splunk, and did not think about doing so.
jhupka, that is a good point with losing 50% capacity with only 2 indexers, that is something that I will be considering. I did some more troubleshooting and think I fixed the problem, as I have not gotten any errors or other problems since implementing the fix. That makes sense that when the index queue is full, the other queues get backed up, I just don't understand how the problem was causing the index or other queues to be completely full and backed up. Perhaps you have some insight or better understanding on the situation?
So what I discovered was when both of my search heads were active, I had the problem of all queues being full and all data being blocked from being forwarded to the indexers. However, if stopped the splunk service from running on just one of the search heads (essentially shutting off one search head and having just one search head running, did not matter which search head), the queues would empty and I had no problems. When both were running, in addition to having all the data being blocked, I had an error on the search heads stating that the configuration bundle was being timed out when it attempted to replicate to the search peers (the default time of 60 seconds, as set with sendRcvTimeout in distsearch.conf). So, I changed the sendRcvTimeout from the default 60 seconds to 300 seconds (5 minutes) on the search heads, restarted the splunk service on the search heads, and now everything has been working without errors since. I'm able to search on both search heads and ingest data without any errors. So two questions I don't understand is what configuration bundle is being replicated from the search heads to the search peers, and how is that causing all the queues starting with the index queue to fill up and not empty?
Very interesting. This is def in the realm of a support case (remember, you've already paid for support, so might as well use it).
Whenever I hear of limits like sendRcvTimeout being changed, I get worried. In my experiences, needing to change a setting like that is indicative of a larger problem and by weakening a limit, you essentially just ignore the symptoms of the problem and can be mislead into thinking the problem is gone.
From what you've described, it sounds like the two symptoms we are seeing are that (1) the indexers are getting backed up writing to disk AND (2) the search heads are taking a while to complete sending bundles to the indexer. I'm going to infer that the indexer is also taking too long to write the bundle to disk, hence the search head taking a while to finish sending the bundle.
The next thing I'd look at are:
These are mutually exclusive. In other words, you could be experiencing BOTH problems 1 and 2, not just 1 or 2.
That's a good point, wouldn't hurt to use the support! That makes sense that the sendRcvTimeout might not be the real root of the problem. Those two symptoms sounds pretty accurate, I'm going to dig some more into this. Thank you very much for the all the insight and help!
BTW: This also sounds like a great candidate for a support case so you can do webex and learn in real time from the talented Splunk support team.