What are recommended/effective techniques for trou... - Page 2

transtrophe · ‎04-05-2015

I am new to Splunk and trying to troubleshoot the "splunk newbies" dreaded "Search peer 'xxx' has the following message: Too many streaming errors to target='target':9997. Not rolling hot buckets on further errors to this target." issue on a distributed deployment built as follows:

Search layer composed of a search head cluster with 3 search heads and the cluster's deployer;
Index layer composed of an indexer cluster with 5 indexers the cluster's master node - cluster is configured using replication factor = 5
Forwarder layer currently deployed with 3 forwarders with a deployment server (not yet doing anything really with these forwarders until I get the issues identified in this post resolved.

I have sos app v 3.2.1 deployed on the shc members (applied successfully using cluster-bundle from the shc deployer) and configured to enable the 2 sos input scripts (sof_sos.sh and ps_sos.sh) on each of the 3 shc members. BTW, after doing this I realized I could and probably will make changes to the /apps/sos/local/inputs.conf on the shc deployer for better, more time efficient configuration management. I also deployed sideview utils v 3.3.2 to the shc members using cluster-bundle, although truth be told I inadvertently deployed sos apps BEFORE sideview utils, although I didn't configure sos before correcting this deployment fopaux (or should we call that foobar - lol).

I also deployed TA-sos v 2.0.5 on the 5 idx-cluster peers using the idx-cluster master node (apply cluster-bundle) and configured each idx-cluster peer to enable the 2 sos input scripts (sof_sos.sh and ps_sos.sh).

On 1 idx-cluster peer I have the following sets of configurations dumped using btool:

From indexes - (only showing configs that are not from the default indexes.conf)

/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_audit] stanza
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto

/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_internal]
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto

/opt/splunk/etc/system/default/indexes.conf              [default]
/opt/splunk/etc/system/default/indexes.conf              repFactor = 0

/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [main]
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto

/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     [sos]
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     coldPath = $SPLUNK_DB/sos/colddb
/opt/splunk/etc/system/default/indexes.conf              defaultDatabase = main
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     disabled = 0
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     frozenTimePeriodInSecs = 2419200
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     homePath = $SPLUNK_DB/sos/db
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     repFactor = auto

From idx cluster peer's inputs:

/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    [script:///opt/splunk/etc/slave-apps/TA-sos/bin/lsof_sos.sh]
/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    index = sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    interval = 600
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    source = lsof_sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    sourcetype = lsof

/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    index = sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    interval = 5
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    source = ps_sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    sourcetype = ps

/opt/splunk/etc/system/default/inputs.conf                             [splunktcp]
/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/system/default/inputs.conf                             acceptFrom = *
/opt/splunk/etc/system/default/inputs.conf                             connection_host = ip
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/system/default/inputs.conf                             index = default
/opt/splunk/etc/system/default/inputs.conf                             route = has_key:_replicationBucketUUID:replicationQueue;has_key:_dstrx:typingQueue;has_key:_linebreaker:indexQueue;absent_key:_linebreaker:parsingQueue

/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  [splunktcp://9997]
/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/system/default/inputs.conf                             index = default

This is the configuration for ALL the idx cluster peers.

On any given idx cluster peer when I analyze the $SPLUNK_HOME/var/log/splunk/splunkd.log I get the following results (using the indicated log analysis command-line in the provided trace):

root@ip-172-31-26-200:/opt/splunk/var/log/splunk# grep CMStreamingErrorJob splunkd.log* | cut -d' ' -f10,12 | sort |uniq -c | sort -nr
178 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A
170 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A
169 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3
163 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF
12 srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
12 srcGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
11 srcGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
10 srcGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3
2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A
1 tgtGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF
1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A
1 tgtGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
1 tgtGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3

If I watch -n 1 'ls -al' the processing of hot buckets for sos, I get the following as a snap-shot in time:

Every 1.0s: ls -alt                                                                                                                          Sun Apr  5 13:19:59 2015

total 132
drwx--x--x  3 splunk splunk  4096 Apr  5 13:19 .
drwx------ 47 splunk splunk  4096 Apr  5 13:19 ..
-rw-------  1 splunk splunk  1585 Apr  5 13:19 1428239997-1428239997-1367416106440166490.tsidx
-rw-------  1 splunk splunk  1616 Apr  5 13:19 1428239992-1428239992-1367415778757555421.tsidx
-rw-------  1 splunk splunk  1617 Apr  5 13:19 1428239987-1428239987-1367415450991673577.tsidx
-rw-------  1 splunk splunk  1969 Apr  5 13:19 1428239982-1428239981-1367415123329070425.tsidx
-rw-------  1 splunk splunk 80922 Apr  5 13:19 1428239977-1428237297-1367414798428158871.tsidx
-rw-------  1 splunk splunk   291 Apr  5 13:19 Hosts.data
-rw-------  1 splunk splunk     7 Apr  5 13:19 .rawSize
-rw-------  1 splunk splunk    97 Apr  5 13:19 Sources.data
-rw-------  1 splunk splunk    97 Apr  5 13:19 SourceTypes.data
drwx------  2 splunk splunk  4096 Apr  5 13:17 rawdata
-rw-------  1 splunk splunk    41 Apr  5 12:34 Strings.data
-rw-------  1 splunk splunk    67 Apr  5 12:34 bucket_info.csv

I've done some minor poking about using ipTraf, iotop, netstat, tried successfully establishing tcp connections from each of the shc members to each of the idx cluster peers using netcat to port 9997.

I really don't know how to go forward on this troubleshootin,g but wanted to give everything I've done so all could hopefully zero in on whatever it is I am missing here. I haven't tried yet to add additional indexers to the cluster - I was thinking about adding 3 more to make a repFactor deployment of 8 / 5 (8 peers over repFactor = 5), but think I'm going to wait on this until I get a solid operating condition on the current deployment (want to control the dependent variables in this little exercise/experiment).

Thanks in advance to any and all that take the time to read my short summary of the situation - lol!

What are recommended/effective techniques for troubleshooting replication and search factor failures?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life