Just setting up my first distributed Splunk deployment. I have a SH Cluster with 3 members using search factor = 3 and a deployer, indexer cluster with 5 peers using replication factor = 5 and the master node, and 3 forwarders using a Deployment server.
I am trying to implement the "best practice" of forwarding the SHC members' internal data to my indexer cluster. I have configured the following outputs.conf on the SHC deployer and pushed it to the 3 SHC members using the splunk apply shcluster-bundle command (this transacted successfully). However, I am getting a blocking problem as follows (from Messages on the SHC member Splunk Web portal):
Forwarding to indexer group transtrophe_search_peers blocked for 68300 seconds
(looks like a long time of blocking - lol).
On the SHC members, the parsing queue is set in the $SPLUNK_HOME/var/run/splunk/merged/server.conf as follows:
[queue=parsingQueue]
cntr_1_lookback_time = 60s
cntr_2_lookback_time = 600s
cntr_3_lookback_time = 900s
maxSize = 6MB
sampling_interval = 1s
Any suggestions on how to resolve this issue?
Problem resolved; forgot to configure receiving port in the inputs.conf on my index cluster members. Did this via a cluster-bundle application from the master node. All the queue blocking dissipated in about 15 seconds.
I have run into a new issue, however, but will open that up in a new question thread -
How to identify root cause and resolve "Search peer has the following message: Too many streaming errors to target=. Not rolling hot buckets on further errors to this target. (This condition might exist with other targets too. Please check the logs)?"
Problem resolved; forgot to configure receiving port in the inputs.conf on my index cluster members. Did this via a cluster-bundle application from the master node. All the queue blocking dissipated in about 15 seconds.
I have run into a new issue, however, but will open that up in a new question thread -
How to identify root cause and resolve "Search peer has the following message: Too many streaming errors to target=. Not rolling hot buckets on further errors to this target. (This condition might exist with other targets too. Please check the logs)?"
Well, I do have firewall rules for inbound connections to ports 8089, 8191 and 9997 for the subnet that the shcluster members, index cluster members and forwarders (as well as the shcluster deployer and forwarders' deployment server). Here is the result just now of doing a nc -vv from the shcluster captain to one of the index cluster members over port 9997:
root@ip-172-31-17-5:/home/admin# nc -vv ip-172-31-18-186 9997
ip-172-31-18-186.ec2.internal [172.31.18.186] 9997 (?) open
?_rawN?6?bm?wV
1427945390
__s2s_bid%37D7692E-5D49-432E-9A6F-89C0C68FACEF
__s2s_rtype
eChallengeMetaData:Hostip-172-31-18-186_raw
It looks like there is a challenge that is no resolving which opens the following question for me: Does there need to be coordination of the symmetrical secret used by the shcluster members and the index cluster members? I know that when using the deployer with the shcluster members this is the case, as well as for the master node and index cluster peers. Didn't see anything in the documentation that called for this pass4symmkey coordination when implementing forwarding from the shcluster members to the index layer of a distributed deployment.
Here is a sample of metrics.log records showing the connection trys and fails - this is from the shcluster captain with the destinations being my index cluster peers. I am observing here that the dest port is 9997 (the replication port) as specified in the outputs.conf [tcpout:transtrophe_search_peers] stanza while the source port of the captain (all the shcluster members actually) = 8089 (the uri mgmt port).
04-02-2015 02:36:20.723 +0000 INFO StatusMgr - destHost=ip-172-31-22-253, destIp=172.31.22.253, destPort=9997, eventType=connect_try, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
04-02-2015 02:36:20.724 +0000 INFO StatusMgr - destHost=ip-172-31-22-253, destIp=172.31.22.253, destPort=9997, eventType=connect_fail, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
04-02-2015 02:36:20.725 +0000 INFO StatusMgr - destHost=ip-172-31-20-120, destIp=172.31.20.120, destPort=9997, eventType=connect_try, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
04-02-2015 02:36:20.726 +0000 INFO StatusMgr - destHost=ip-172-31-20-120, destIp=172.31.20.120, destPort=9997, eventType=connect_fail, publisher=tcpout, sourcePort=8089, statusee=TcpOutputProcessor
On the other hand... actual failed connections from the metrics log suggest that you might have some super specific firewall rules happening. Aside from this forwarding... the indexers are not expecting data from the search heads on 9997. Be sure that's possible... and that the ports will accept the data from those ip's. I know it sounds far fetched... but it couldn't hurt to be sure. I'm sure someone else will pipe up overnight...
It's hard to tell whether you are showing us the output of btool and therefore we'll see the inherited directives from default or whether you are actually repeating them. I mention this because the one item that is not like default is not setting what it should...
In the case of :
forwardedindex.0.whitelist = .
forwardedindex.1.blacklist = _.
without the asterisk, that's a literal "."
it should be:
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
your configuration of the [queue=parsingQueue] stanza is also redundant as you've now deliberately set them to the values they inherit anyway...
Other than that I'm not seeing any untoward settings. If that is the output of btool, you might want to cut it down to just the settings you are pushing. Because it could just be my eyes. 😉
Also, I was just looking at the metrics.log on one of the shcluster members for all the blocked entries and based on those records adjusted the size of the blocked queues to be about 15% larger then the noted largest size. Then, after making these changes, I did a rolling-restart from the captain.
Checking metrics.log I am observing the following log record on the captain and 1 of the shcluster members:
04-02-2015 02:30:07.252 +0000 INFO Metrics - group=queue, name=indexqueue, blocked=true, max_size_kb=2048, current_size_kb=2047, current_size=766, largest_size=766, smallest_size=766
Not sure why this indexqueue would be blocking when the max_size_kb is significantly larger than the current_size
Yes, that is the document reference I was talking about.
Here is the outputs.conf that I pushed from the deployer to the shcluster members:
[tcpout]
maxQueueSize = auto
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_internal|_introspection)
forwardedindex.filter.disable = true
indexAndForward = false
autoLBFrequency = 30
blockOnCloning = true
compressed = false
disabled = false
dropClonedEventsOnQueueFull = 5
dropEventsOnQueueFull = -1
heartbeatFrequency = 30
maxFailuresPerInterval = 2
secsInFailureInterval = 1
maxConnectionsPerIndexer = 2
forceTimebasedAutoLB = false
sendCookedData = true
connectionTimeout = 20
readTimeout = 300
writeTimeout = 300
useACK = false
blockWarnThreshold = 100
sslQuietShutdown = false
defaultGroup = transtrophe_search_peers
[syslog]
type = udp
priority = <13>
dropEventsOnQueueFull = -1
maxEventSize = 1024
[indexAndForward]
index = false
[tcpout:transtrophe_search_peers]
server=ip-172-31-20-173:9997,ip-172-31-18-186:9997,ip-172-31-22-253:9997,ip-172-31-26-200:9997,ip-172-31-20-120:9997
autoLB = true
you forgot to post the outputs.conf contents...
By "best practice" you're referring to these instructions: http://docs.splunk.com/Documentation/Splunk/6.2.2/DistSearch/Forwardsearchheaddata
Correct?