I am new to Splunk and trying to troubleshoot the "splunk newbies" dreaded "Search peer 'xxx' has the following message: Too many streaming errors to target='target':9997. Not rolling hot buckets on further errors to this target." issue on a distributed deployment built as follows:
Search layer composed of a search head cluster with 3 search heads and the cluster's deployer;
Index layer composed of an indexer cluster with 5 indexers the cluster's master node - cluster is configured using replication factor = 5
Forwarder layer currently deployed with 3 forwarders with a deployment server (not yet doing anything really with these forwarders until I get the issues identified in this post resolved.
I have sos app v 3.2.1 deployed on the shc members (applied successfully using cluster-bundle from the shc deployer) and configured to enable the 2 sos input scripts (sof_sos.sh and ps_sos.sh) on each of the 3 shc members. BTW, after doing this I realized I could and probably will make changes to the /apps/sos/local/inputs.conf on the shc deployer for better, more time efficient configuration management. I also deployed sideview utils v 3.3.2 to the shc members using cluster-bundle, although truth be told I inadvertently deployed sos apps BEFORE sideview utils, although I didn't configure sos before correcting this deployment fopaux (or should we call that foobar - lol).
I also deployed TA-sos v 2.0.5 on the 5 idx-cluster peers using the idx-cluster master node (apply cluster-bundle) and configured each idx-cluster peer to enable the 2 sos input scripts (sof_sos.sh and ps_sos.sh).
On 1 idx-cluster peer I have the following sets of configurations dumped using btool:
From indexes - (only showing configs that are not from the default indexes.conf)
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_audit] stanza
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_internal]
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto
/opt/splunk/etc/system/default/indexes.conf [default]
/opt/splunk/etc/system/default/indexes.conf repFactor = 0
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [main]
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf [sos]
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf coldPath = $SPLUNK_DB/sos/colddb
/opt/splunk/etc/system/default/indexes.conf defaultDatabase = main
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf disabled = 0
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf frozenTimePeriodInSecs = 2419200
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf homePath = $SPLUNK_DB/sos/db
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf repFactor = auto
From idx cluster peer's inputs:
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf [script:///opt/splunk/etc/slave-apps/TA-sos/bin/lsof_sos.sh]
/opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf index = sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf interval = 600
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf source = lsof_sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf sourcetype = lsof
/opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf index = sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf interval = 5
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf source = ps_sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf sourcetype = ps
/opt/splunk/etc/system/default/inputs.conf [splunktcp]
/opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864
/opt/splunk/etc/system/default/inputs.conf acceptFrom = *
/opt/splunk/etc/system/default/inputs.conf connection_host = ip
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237
/opt/splunk/etc/system/default/inputs.conf index = default
/opt/splunk/etc/system/default/inputs.conf route = has_key:_replicationBucketUUID:replicationQueue;has_key:_dstrx:typingQueue;has_key:_linebreaker:indexQueue;absent_key:_linebreaker:parsingQueue
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf [splunktcp://9997]
/opt/splunk/etc/system/default/inputs.conf _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf host = ip-172-31-26-237
/opt/splunk/etc/system/default/inputs.conf index = default
This is the configuration for ALL the idx cluster peers.
On any given idx cluster peer when I analyze the $SPLUNK_HOME/var/log/splunk/splunkd.log I get the following results (using the indicated log analysis command-line in the provided trace):
root@ip-172-31-26-200:/opt/splunk/var/log/splunk# grep CMStreamingErrorJob splunkd.log* | cut -d' ' -f10,12 | sort |uniq -c | sort -nr
178 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A
170 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A
169 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3
163 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF
12 srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
12 srcGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
11 srcGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
10 srcGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3
2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A
1 tgtGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF
1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A
1 tgtGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
1 tgtGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
If I watch -n 1 'ls -al' the processing of hot buckets for sos, I get the following as a snap-shot in time:
Every 1.0s: ls -alt Sun Apr 5 13:19:59 2015
total 132
drwx--x--x 3 splunk splunk 4096 Apr 5 13:19 .
drwx------ 47 splunk splunk 4096 Apr 5 13:19 ..
-rw------- 1 splunk splunk 1585 Apr 5 13:19 1428239997-1428239997-1367416106440166490.tsidx
-rw------- 1 splunk splunk 1616 Apr 5 13:19 1428239992-1428239992-1367415778757555421.tsidx
-rw------- 1 splunk splunk 1617 Apr 5 13:19 1428239987-1428239987-1367415450991673577.tsidx
-rw------- 1 splunk splunk 1969 Apr 5 13:19 1428239982-1428239981-1367415123329070425.tsidx
-rw------- 1 splunk splunk 80922 Apr 5 13:19 1428239977-1428237297-1367414798428158871.tsidx
-rw------- 1 splunk splunk 291 Apr 5 13:19 Hosts.data
-rw------- 1 splunk splunk 7 Apr 5 13:19 .rawSize
-rw------- 1 splunk splunk 97 Apr 5 13:19 Sources.data
-rw------- 1 splunk splunk 97 Apr 5 13:19 SourceTypes.data
drwx------ 2 splunk splunk 4096 Apr 5 13:17 rawdata
-rw------- 1 splunk splunk 41 Apr 5 12:34 Strings.data
-rw------- 1 splunk splunk 67 Apr 5 12:34 bucket_info.csv
I've done some minor poking about using ipTraf, iotop, netstat, tried successfully establishing tcp connections from each of the shc members to each of the idx cluster peers using netcat to port 9997.
I really don't know how to go forward on this troubleshootin,g but wanted to give everything I've done so all could hopefully zero in on whatever it is I am missing here. I haven't tried yet to add additional indexers to the cluster - I was thinking about adding 3 more to make a repFactor deployment of 8 / 5 (8 peers over repFactor = 5), but think I'm going to wait on this until I get a solid operating condition on the current deployment (want to control the dependent variables in this little exercise/experiment).
Thanks in advance to any and all that take the time to read my short summary of the situation - lol!