I am new to Splunk and trying to troubleshoot the "splunk newbies" dreaded "Search peer 'xxx' has the following message: Too many streaming errors to target='target':9997. Not rolling hot buckets on further errors to this target." issue on a distributed deployment built as follows:
Search layer composed of a search head cluster with 3 search heads and the cluster's deployer;
Index layer composed of an indexer cluster with 5 indexers the cluster's master node - cluster is configured using replication factor = 5
Forwarder layer currently deployed with 3 forwarders with a deployment server (not yet doing anything really with these forwarders until I get the issues identified in this post resolved.
I have sos app v 3.2.1 deployed on the shc members (applied successfully using cluster-bundle from the shc deployer) and configured to enable the 2 sos input scripts (sofsos.sh and pssos.sh) on each of the 3 shc members. BTW, after doing this I realized I could and probably will make changes to the /apps/sos/local/inputs.conf on the shc deployer for better, more time efficient configuration management. I also deployed sideview utils v 3.3.2 to the shc members using cluster-bundle, although truth be told I inadvertently deployed sos apps BEFORE sideview utils, although I didn't configure sos before correcting this deployment fopaux (or should we call that foobar - lol).
I also deployed TA-sos v 2.0.5 on the 5 idx-cluster peers using the idx-cluster master node (apply cluster-bundle) and configured each idx-cluster peer to enable the 2 sos input scripts (sofsos.sh and pssos.sh).
On 1 idx-cluster peer I have the following sets of configurations dumped using btool:
From indexes - (only showing configs that are not from the default indexes.conf)
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_audit] stanza /opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto /opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_internal] /opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto /opt/splunk/etc/system/default/indexes.conf [default] /opt/splunk/etc/system/default/indexes.conf repFactor = 0 /opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [main] /opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto /opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf [sos] /opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf coldPath = $SPLUNK_DB/sos/colddb /opt/splunk/etc/system/default/indexes.conf defaultDatabase = main /opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf disabled = 0 /opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf frozenTimePeriodInSecs = 2419200 /opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf homePath = $SPLUNK_DB/sos/db /opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf repFactor = auto
From idx cluster peer's inputs:
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf [script:///opt/splunk/etc/slave-apps/TA-sos/bin/lsof_sos.sh] /opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864 /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf disabled = 0 /opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237 /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf index = sos /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf interval = 600 /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf source = lsof_sos /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf sourcetype = lsof /opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864 /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf disabled = 0 /opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237 /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf index = sos /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf interval = 5 /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf source = ps_sos /opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf sourcetype = ps /opt/splunk/etc/system/default/inputs.conf [splunktcp] /opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864 /opt/splunk/etc/system/default/inputs.conf acceptFrom = * /opt/splunk/etc/system/default/inputs.conf connection_host = ip /opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237 /opt/splunk/etc/system/default/inputs.conf index = default /opt/splunk/etc/system/default/inputs.conf route = has_key:_replicationBucketUUID:replicationQueue;has_key:_dstrx:typingQueue;has_key:_linebreaker:indexQueue;absent_key:_linebreaker:parsingQueue /opt/splunk/etc/slave-apps/_cluster/local/inputs.conf [splunktcp://9997] /opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864 /opt/splunk/etc/slave-apps/_cluster/local/inputs.conf disabled = 0 /opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237 /opt/splunk/etc/system/default/inputs.conf index = default
This is the configuration for ALL the idx cluster peers.
On any given idx cluster peer when I analyze the $SPLUNK_HOME/var/log/splunk/splunkd.log I get the following results (using the indicated log analysis command-line in the provided trace):
root@ip-172-31-26-200:/opt/splunk/var/log/splunk# grep CMStreamingErrorJob splunkd.log* | cut -d' ' -f10,12 | sort |uniq -c | sort -nr 178 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A 170 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A 169 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 163 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF 12 srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 12 srcGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 11 srcGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 10 srcGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A 1 tgtGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF 1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A 1 tgtGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 1 tgtGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
If I watch -n 1 'ls -al' the processing of hot buckets for sos, I get the following as a snap-shot in time:
Every 1.0s: ls -alt Sun Apr 5 13:19:59 2015 total 132 drwx--x--x 3 splunk splunk 4096 Apr 5 13:19 . drwx------ 47 splunk splunk 4096 Apr 5 13:19 .. -rw------- 1 splunk splunk 1585 Apr 5 13:19 1428239997-1428239997-1367416106440166490.tsidx -rw------- 1 splunk splunk 1616 Apr 5 13:19 1428239992-1428239992-1367415778757555421.tsidx -rw------- 1 splunk splunk 1617 Apr 5 13:19 1428239987-1428239987-1367415450991673577.tsidx -rw------- 1 splunk splunk 1969 Apr 5 13:19 1428239982-1428239981-1367415123329070425.tsidx -rw------- 1 splunk splunk 80922 Apr 5 13:19 1428239977-1428237297-1367414798428158871.tsidx -rw------- 1 splunk splunk 291 Apr 5 13:19 Hosts.data -rw------- 1 splunk splunk 7 Apr 5 13:19 .rawSize -rw------- 1 splunk splunk 97 Apr 5 13:19 Sources.data -rw------- 1 splunk splunk 97 Apr 5 13:19 SourceTypes.data drwx------ 2 splunk splunk 4096 Apr 5 13:17 rawdata -rw------- 1 splunk splunk 41 Apr 5 12:34 Strings.data -rw------- 1 splunk splunk 67 Apr 5 12:34 bucket_info.csv
I've done some minor poking about using ipTraf, iotop, netstat, tried successfully establishing tcp connections from each of the shc members to each of the idx cluster peers using netcat to port 9997.
I really don't know how to go forward on this troubleshootin,g but wanted to give everything I've done so all could hopefully zero in on whatever it is I am missing here. I haven't tried yet to add additional indexers to the cluster - I was thinking about adding 3 more to make a repFactor deployment of 8 / 5 (8 peers over repFactor = 5), but think I'm going to wait on this until I get a solid operating condition on the current deployment (want to control the dependent variables in this little exercise/experiment).
Thanks in advance to any and all that take the time to read my short summary of the situation - lol!
On the search head cluster, when I run SOS from any member, I get the following info message when I execute the search of resource utilization of any monitored resource -
search index=sos sourcetype="ps" host="ip-172-31-23-108" | multikv | eval type=case(like(ARGS, "%search%"),"searches",like(ARGS, "%root.py%start%") OR like(COMMAND, "%splunkweb%") OR (like(COMMAND,"%python%") AND like(ARGS,"%appserver%")), "Splunk Web",like(ARGS,"%-p%start%") OR (like(COMMAND,"%splunkd%") AND like(ARGS, "service")),"splunkd server") | eval RSZMB=RSZKB/1024 | eval VSZMB=VSZKB/1024 | bin time span=5s | stats first(pctCPU) AS pctCPU, first(RSZMB) AS RSZMB, first(VSZMB) AS VSZMB first(type) AS type by PID _time | stats sum(pctCPU) AS pctCPU, sum(RSZMB) AS RSZMB, sum(VSZMB) AS VSZMB by type, _time | bin _time span=4h | sistats avg(pctCPU), median(pctCPU), median(RSZMB), median(VSZMB) by type, _time
over the time range: (earliest indexed event) – 1/1/01 12:00:00.000 AM
did not return any data. Possible solutions are to: relax the primary search criteria
widen the time range of the search
check that the default search indexes for your account include the desired indexes
The following messages were returned by the search subsystem: WARN: Search filters specified using splunkserver/splunkservergroup do not match any search peer.
Search filters specified using splunkserver/splunkserver_group do not match any search peer.
Also, I am running ALL hosts in an AWS EC2 using the debian-wheezy-amd64-pvm-2015-01-28-ebs AMI run as a t1.micro instance type with storage of 15GB deployed on the shc members and the 4 forwarders (previous post indicated 3 forwarders - I actually have 4; not that this is a likely part of the issue) and 20GB on the index cluster members. Don't know if this is involved as an impact (the t1.micro instance), so wanted to bring it in as a possible factor.
The micro instance can DEFINITELY be a factor. AWS micro instances have very limited network bandwidth; AWS calls it "very low." So it is entirely possible that the bucket replication is failing because of poor/sporadic network performance. This can also affect forwarding, etc.
Spin up a small set of larger instances and see if the problem persists. In AWS, I use CentOS on either m1.medium or c3.large; these instances have "moderate" network performance. For production, you might want to choose a size that has even better network performance.
Even for testing, AWS micro instances are not sufficient IMO.
I increased the size of all the indexers (8 in my index cluster using repFactor = 5) and all of the search-head members of my search-head cluster (3 in my shc using searchFactor = 3) from ti.micro to m3.medium. No change - still getting those dreaded:
Search peer ip-172-31-20-173 has the following message: Too many streaming errors to target=172.31.20.120:9998. Not rolling hot buckets on further errors to this target. (This condition might exist with other targets too. Please check the logs)
Also changed the pass4SymmKey in each server.conf on all 8 indexers and the search-head members. I even did this by leaving the index cluster master with the previous value for pass4SymmKey to make sure that the indexers all got changed correctly. Leaving the master with the previous pass4SymmKey value resulted in the cluster not forming. After verifying that the cluster didn't form, I then made the change of the cluster master's pass4SymmKey to the values on each of the 8 indexers and the cluster then formed. This was the only way I could think of to verify that the pass4SymmKey was fully synchronized across the index cluster.
Really not sure where to take this now. I can certainly provide any splunkd.log records (or any other log) that anyone might suggest (either on on indexer and/or on a search-head member.
Some additional troubleshooting info. I have been analyzing the splunkd.log on one of the shc (search-head cluster) members and found the following entries (this is after a long set of records that traces problems that tagged prior to my finally realizing that I had forgotten to configure receiving on the indexer layer):
04-04-2015 03:27:59.122 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.173:9997
04-04-2015 03:28:29.154 +0000 INFO TcpOutputProc - Connected to idx=172.31.26.200:9997
04-04-2015 03:28:59.230 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.120:9997
04-04-2015 03:31:58.502 +0000 INFO TcpOutputProc - Connected to idx=172.31.18.186:9997
04-04-2015 03:32:13.735 +0000 WARN DistributedPeerManagerHeartbeat - Unable to get server info from peer: http://172.31.20.173:8089 due to: Connection refused
I'm not really sure what the root-cause problems are when a "WARN DistributedPeerManagerHeartbeat - Unable to get server info from peer: http://172.31.18.186:8089 due to: Connection refused" occurs.
I assume that the ports are open in the AWS firewall rules? By default, they are not.
Both 8089 and 9997... plus 8000 if you are using it...
Ports are open in my firewall rules. I'm going to check if pumping up the instance type has an impact. Also, a previous reply suggested I check on the pass4symkey coordination within the index cluster and index cluster to search-head cluster. Almost certain I don't have an issue in that regard as I am generally diligent with those kinds of details. But it's easy enough to resynch those keys so I'll just do it.
Won't be getting to this until later this afternoon but will update this thread with the outcomes.
This is my firewall ruleset:
RDP TCP 3389 my.local.host.IP/32
Custom TCP Rule TCP 514 0.0.0.0/0
SSH TCP 22 my.local.host.IP/32
Custom TCP Rule TCP 8191 my.aws.vlan.ip/20
Custom TCP Rule TCP 8089 my.aws.vlan.ip/20
Custom UDP Rule UDP 514 0.0.0.0/0
Custom TCP Rule TCP 9997 - 9999 my.aws.vlan.ip/20
Custom TCP Rule TCP 8000 0.0.0.0/0
Custom TCP Rule TCP 5900 - 5909 my.local.host.IP/32
Custom TCP Rule TCP 5800 0.0.0.0/0
[Actual IPs obfuscated to protect the innocent - lol!]
Also don't understand the meaning and/or implications of this record:
04-04-2015 04:23:12.569 +0000 WARN DistributedPeerManager - Could not delete message id=CLUSTERREPLICATIONLINKFAILURE from peer: responsecode=500