Deployment Architecture

What are recommended/effective techniques for troubleshooting replication and search factor failures?

Communicator

I am new to Splunk and trying to troubleshoot the "splunk newbies" dreaded "Search peer 'xxx' has the following message: Too many streaming errors to target='target':9997. Not rolling hot buckets on further errors to this target." issue on a distributed deployment built as follows:

Search layer composed of a search head cluster with 3 search heads and the cluster's deployer;
Index layer composed of an indexer cluster with 5 indexers the cluster's master node - cluster is configured using replication factor = 5
Forwarder layer currently deployed with 3 forwarders with a deployment server (not yet doing anything really with these forwarders until I get the issues identified in this post resolved.

I have sos app v 3.2.1 deployed on the shc members (applied successfully using cluster-bundle from the shc deployer) and configured to enable the 2 sos input scripts (sofsos.sh and pssos.sh) on each of the 3 shc members. BTW, after doing this I realized I could and probably will make changes to the /apps/sos/local/inputs.conf on the shc deployer for better, more time efficient configuration management. I also deployed sideview utils v 3.3.2 to the shc members using cluster-bundle, although truth be told I inadvertently deployed sos apps BEFORE sideview utils, although I didn't configure sos before correcting this deployment fopaux (or should we call that foobar - lol).

I also deployed TA-sos v 2.0.5 on the 5 idx-cluster peers using the idx-cluster master node (apply cluster-bundle) and configured each idx-cluster peer to enable the 2 sos input scripts (sofsos.sh and pssos.sh).

On 1 idx-cluster peer I have the following sets of configurations dumped using btool:

From indexes - (only showing configs that are not from the default indexes.conf)

/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_audit] stanza
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto

/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [_internal]
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto

/opt/splunk/etc/system/default/indexes.conf              [default]
/opt/splunk/etc/system/default/indexes.conf              repFactor = 0

/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf [main]
/opt/splunk/etc/slave-apps/_cluster/default/indexes.conf repFactor = auto

/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     [sos]
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     coldPath = $SPLUNK_DB/sos/colddb
/opt/splunk/etc/system/default/indexes.conf              defaultDatabase = main
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     disabled = 0
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     frozenTimePeriodInSecs = 2419200
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     homePath = $SPLUNK_DB/sos/db
/opt/splunk/etc/slave-apps/TA-sos/local/indexes.conf     repFactor = auto

From idx cluster peer's inputs:

/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    [script:///opt/splunk/etc/slave-apps/TA-sos/bin/lsof_sos.sh]
/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    index = sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    interval = 600
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    source = lsof_sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    sourcetype = lsof

/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    index = sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    interval = 5
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    source = ps_sos
/opt/splunk/etc/slave-apps/TA-sos/local/inputs.conf                    sourcetype = ps

/opt/splunk/etc/system/default/inputs.conf                             [splunktcp]
/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/system/default/inputs.conf                             acceptFrom = *
/opt/splunk/etc/system/default/inputs.conf                             connection_host = ip
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/system/default/inputs.conf                             index = default
/opt/splunk/etc/system/default/inputs.conf                             route = has_key:_replicationBucketUUID:replicationQueue;has_key:_dstrx:typingQueue;has_key:_linebreaker:indexQueue;absent_key:_linebreaker:parsingQueue

/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  [splunktcp://9997]
/opt/splunk/etc/system/default/inputs.conf                             _rcvbuf = 1572864
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  disabled = 0
/opt/splunk/etc/slave-apps/_cluster/local/inputs.conf                  host = ip-172-31-26-237
/opt/splunk/etc/system/default/inputs.conf                             index = default

This is the configuration for ALL the idx cluster peers.

On any given idx cluster peer when I analyze the $SPLUNK_HOME/var/log/splunk/splunkd.log I get the following results (using the indicated log analysis command-line in the provided trace):

root@ip-172-31-26-200:/opt/splunk/var/log/splunk# grep CMStreamingErrorJob splunkd.log* | cut -d' ' -f10,12 | sort |uniq -c | sort -nr
178 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A
170 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A
169 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3
163 srcGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF
12 srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
12 srcGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
11 srcGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
10 srcGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3
2 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A
1 tgtGuid=93BAC151-E057-4F37-9D16-C4CAB5A971E3 failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF
1 tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 failingGuid=27EE43F3-BD89-4BC4-9200-298F99B4275A
1 tgtGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3
1 tgtGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3

If I watch -n 1 'ls -al' the processing of hot buckets for sos, I get the following as a snap-shot in time:

Every 1.0s: ls -alt                                                                                                                          Sun Apr  5 13:19:59 2015

total 132
drwx--x--x  3 splunk splunk  4096 Apr  5 13:19 .
drwx------ 47 splunk splunk  4096 Apr  5 13:19 ..
-rw-------  1 splunk splunk  1585 Apr  5 13:19 1428239997-1428239997-1367416106440166490.tsidx
-rw-------  1 splunk splunk  1616 Apr  5 13:19 1428239992-1428239992-1367415778757555421.tsidx
-rw-------  1 splunk splunk  1617 Apr  5 13:19 1428239987-1428239987-1367415450991673577.tsidx
-rw-------  1 splunk splunk  1969 Apr  5 13:19 1428239982-1428239981-1367415123329070425.tsidx
-rw-------  1 splunk splunk 80922 Apr  5 13:19 1428239977-1428237297-1367414798428158871.tsidx
-rw-------  1 splunk splunk   291 Apr  5 13:19 Hosts.data
-rw-------  1 splunk splunk     7 Apr  5 13:19 .rawSize
-rw-------  1 splunk splunk    97 Apr  5 13:19 Sources.data
-rw-------  1 splunk splunk    97 Apr  5 13:19 SourceTypes.data
drwx------  2 splunk splunk  4096 Apr  5 13:17 rawdata
-rw-------  1 splunk splunk    41 Apr  5 12:34 Strings.data
-rw-------  1 splunk splunk    67 Apr  5 12:34 bucket_info.csv

I've done some minor poking about using ipTraf, iotop, netstat, tried successfully establishing tcp connections from each of the shc members to each of the idx cluster peers using netcat to port 9997.

I really don't know how to go forward on this troubleshootin,g but wanted to give everything I've done so all could hopefully zero in on whatever it is I am missing here. I haven't tried yet to add additional indexers to the cluster - I was thinking about adding 3 more to make a repFactor deployment of 8 / 5 (8 peers over repFactor = 5), but think I'm going to wait on this until I get a solid operating condition on the current deployment (want to control the dependent variables in this little exercise/experiment).

Thanks in advance to any and all that take the time to read my short summary of the situation - lol!

Splunk Employee
Splunk Employee

your common point of failure is peer= 78E65DE9... in the logs.

can you ensure all your indexers are able to communicate to that specific indexer?

0 Karma

Path Finder

If you have pass4SymmKey defined in both the [general] stanza and [clustering] stanza, the encrypted value must also be the same, and if not, this setup will fail.

To avoid this, I recommend that you always set the pass4SymmKey only in the [general] stanza for all instances. See link below for more info.

http://docs.splunk.com/Documentation/Splunk/6.2.5/Indexer/Enableclustersindetail#Configure_the_secre...

Splunk Employee
Splunk Employee

If you remember the pass4SymmKey value, the quickest is to modify the server.conf and restart the server.
Steps:
1. Navigate to $SPLUNK_HOME/etc/system/local
2. Open server.conf file
3. There might be different stanzas, e.g. [clustering] or [clustermaster].

Set secret value:
pass4SymmKey = < >

  1. Save changes.
    1. Restart Splunk.
0 Karma

Communicator

Still stuck in this issue; I.e. no resolution - Splunk Web for a cluster index peer, the cluster master, or any search head cluster member throws on-going error messages about too many streaming errors so no replication. Oddly, my _internal, _audit, and SOS indexes continue to grow in terms of number of events, but trying to use SOS views for example of resource usage of an index cluster peer or search head cluster member still results in the previously mentioned error about no search results available.

Can't really move forward on my project, unfortunately.

0 Karma

SplunkTrust
SplunkTrust

As per below logs, I think you have given different pass4Symmkey on Indexer and Cluster Master??

04-04-2015 04:37:50.847 +0000 WARN CMMasterProxy - Master is down! Make sure pass4SymmKey is matching if master is running.
04-04-2015 04:37:50.847 +0000 ERROR ClusteringMgr - VerifyMultisiteConfig failed Error=failed method=GET path=/services/cluster/master/info/?outputmode=json master=ip-172-31-26-237:
8089 rv=0 actual
responsecode=502 expectedresponsecode=200 statusline=Error connecting: Connection refused error=Connection refused

Also please check 8089 port between CM,SH and IDX.

0 Karma

Communicator

Changed the pass4SymmKey in each server.conf on all 8 indexers and the search-head members. I even did this by leaving the index cluster master with the previous value for pass4SymmKey to make sure that the indexers all got changed correctly. Leaving the master with the previous pass4SymmKey value resulted in the cluster not forming. After verifying that the cluster didn't form, I then made the change of the cluster master's pass4SymmKey to the values on each of the 8 indexers and the cluster then formed. This was the only way I could think of to verify that the pass4SymmKey was fully synchronized across the index cluster.

Really not sure where to take this now. I can certainly provide any splunkd.log records (or any other log) that anyone might suggest (either on on indexer and/or on a search-head member.

0 Karma

Communicator

I'll check on this, but it will be later this afternoon. I will likely just change the pass4symkey value in a limited version of server.conf (ie just the pass4symkey element in the clustering stanza) in both the shcluster deployed bundle location and the cluster master bundle location executing the applicable bundle apply to each cluster.

Will update the post when this is completed.

0 Karma

Communicator

For the moment, though, I will mention that I can and do execute successful bundle pushes from the index cluster master to all 8 of my index cluster peers. I also see currently on any search-head on-going log entries in splunkd.log that the instance from which I'm running a tail in splunkd.log is making on-going connections to each of the 8 indexers. However, those are now the ONLY entries appearing in splunkd.log. that doesn't seem normal or expected behavior.

0 Karma

Legend

server.conf would be an interesting conf file to examine as well, on all the search heads, the master and the indexers.

0 Karma

Communicator

I do spend a lot of time looking over the server.conf. anything in particular you suggest in that regard for possible impacters to this hot bucket replication/too many streaming errors conditions that is occurring in my deployment? I am only forwarding the search head data from _internal and _audit to the indexer layer. I can't really get a handle for why this would be causing a condition of too many streaming errors.

0 Karma

Communicator

Some more possibly important records:

04-04-2015 04:34:27.934 +0000 INFO TcpOutputProc - Connected to idx=172.31.18.186:9997
04-04-2015 04:34:58.007 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.120:9997
04-04-2015 04:35:28.050 +0000 INFO TcpOutputProc - Connected to idx=172.31.26.200:9997
04-04-2015 04:35:58.089 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.173:9997
04-04-2015 04:36:58.151 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.120:9997
04-04-2015 04:37:47.871 +0000 INFO PipelineComponent - Performing early shutdown tasks
04-04-2015 04:37:47.871 +0000 INFO IndexProcessor - handleSignal : Disabling streaming searches.
04-04-2015 04:37:47.871 +0000 INFO IndexProcessor - request state change from=RUN to=SHUTDOWNSIGNALED
04-04-2015 04:37:50.847 +0000 WARN CMMasterProxy - Master is down! Make sure pass4SymmKey is matching if master is running.
04-04-2015 04:37:50.847 +0000 ERROR ClusteringMgr - VerifyMultisiteConfig failed Error=failed method=GET path=/services/cluster/master/info/?output
mode=json master=ip-172-31-26-237:
8089 rv=0 actualresponsecode=502 expectedresponsecode=200 status_line=Error connecting: Connection refused error=Connection refused
04-04-2015 04:37:50.847 +0000 ERROR GenerationGrabber - The searchhead is unable to update the peer information. Error = 'Unable to reach master' for master=https://ip-172-31-26-237:
8089.

0 Karma

Communicator

Also don't understand the meaning and/or implications of this record:

04-04-2015 04:23:12.569 +0000 WARN DistributedPeerManager - Could not delete message id=CLUSTERREPLICATIONLINKFAILURE from peer: responsecode=500

0 Karma

Communicator

Some additional troubleshooting info. I have been analyzing the splunkd.log on one of the shc (search-head cluster) members and found the following entries (this is after a long set of records that traces problems that tagged prior to my finally realizing that I had forgotten to configure receiving on the indexer layer):

04-04-2015 03:27:59.122 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.173:9997
04-04-2015 03:28:29.154 +0000 INFO TcpOutputProc - Connected to idx=172.31.26.200:9997
04-04-2015 03:28:59.230 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.120:9997
04-04-2015 03:31:58.502 +0000 INFO TcpOutputProc - Connected to idx=172.31.18.186:9997
04-04-2015 03:32:13.735 +0000 WARN DistributedPeerManagerHeartbeat - Unable to get server info from peer: http://172.31.20.173:8089 due to: Connection refused

I'm not really sure what the root-cause problems are when a "WARN DistributedPeerManagerHeartbeat - Unable to get server info from peer: http://172.31.18.186:8089 due to: Connection refused" occurs.

0 Karma

Legend

I assume that the ports are open in the AWS firewall rules? By default, they are not.
Both 8089 and 9997... plus 8000 if you are using it...

0 Karma

Communicator

This is my firewall ruleset:

RDP TCP 3389 my.local.host.IP/32
Custom TCP Rule TCP 514 0.0.0.0/0
SSH TCP 22 my.local.host.IP/32
Custom TCP Rule TCP 8191 my.aws.vlan.ip/20
Custom TCP Rule TCP 8089 my.aws.vlan.ip/20
Custom UDP Rule UDP 514 0.0.0.0/0
Custom TCP Rule TCP 9997 - 9999 my.aws.vlan.ip/20
Custom TCP Rule TCP 8000 0.0.0.0/0
Custom TCP Rule TCP 5900 - 5909 my.local.host.IP/32
Custom TCP Rule TCP 5800 0.0.0.0/0

[Actual IPs obfuscated to protect the innocent - lol!]

0 Karma

Communicator

Ports are open in my firewall rules. I'm going to check if pumping up the instance type has an impact. Also, a previous reply suggested I check on the pass4symkey coordination within the index cluster and index cluster to search-head cluster. Almost certain I don't have an issue in that regard as I am generally diligent with those kinds of details. But it's easy enough to resynch those keys so I'll just do it.

Won't be getting to this until later this afternoon but will update this thread with the outcomes.

0 Karma

Communicator

Also, I am running ALL hosts in an AWS EC2 using the debian-wheezy-amd64-pvm-2015-01-28-ebs AMI run as a t1.micro instance type with storage of 15GB deployed on the shc members and the 4 forwarders (previous post indicated 3 forwarders - I actually have 4; not that this is a likely part of the issue) and 20GB on the index cluster members. Don't know if this is involved as an impact (the t1.micro instance), so wanted to bring it in as a possible factor.

0 Karma

Legend

The micro instance can DEFINITELY be a factor. AWS micro instances have very limited network bandwidth; AWS calls it "very low." So it is entirely possible that the bucket replication is failing because of poor/sporadic network performance. This can also affect forwarding, etc.

Spin up a small set of larger instances and see if the problem persists. In AWS, I use CentOS on either m1.medium or c3.large; these instances have "moderate" network performance. For production, you might want to choose a size that has even better network performance.

Even for testing, AWS micro instances are not sufficient IMO.

0 Karma

Communicator

I increased the size of all the indexers (8 in my index cluster using repFactor = 5) and all of the search-head members of my search-head cluster (3 in my shc using searchFactor = 3) from ti.micro to m3.medium. No change - still getting those dreaded:

Search peer ip-172-31-20-173 has the following message: Too many streaming errors to target=172.31.20.120:9998. Not rolling hot buckets on further errors to this target. (This condition might exist with other targets too. Please check the logs)

Also changed the pass4SymmKey in each server.conf on all 8 indexers and the search-head members. I even did this by leaving the index cluster master with the previous value for pass4SymmKey to make sure that the indexers all got changed correctly. Leaving the master with the previous pass4SymmKey value resulted in the cluster not forming. After verifying that the cluster didn't form, I then made the change of the cluster master's pass4SymmKey to the values on each of the 8 indexers and the cluster then formed. This was the only way I could think of to verify that the pass4SymmKey was fully synchronized across the index cluster.

Really not sure where to take this now. I can certainly provide any splunkd.log records (or any other log) that anyone might suggest (either on on indexer and/or on a search-head member.

0 Karma

Communicator

On the search head cluster, when I run SOS from any member, I get the following info message when I execute the search of resource utilization of any monitored resource -

search index=sos sourcetype="ps" host="ip-172-31-23-108" | multikv | eval type=case(like(ARGS, "%search%"),"searches",like(ARGS, "%root.py%start%") OR like(COMMAND, "%splunkweb%") OR (like(COMMAND,"%python%") AND like(ARGS,"%appserver%")), "Splunk Web",like(ARGS,"%-p%start%") OR (like(COMMAND,"%splunkd%") AND like(ARGS, "service")),"splunkd server") | eval RSZMB=RSZKB/1024 | eval VSZMB=VSZKB/1024 | bin time span=5s | stats first(pctCPU) AS pctCPU, first(RSZMB) AS RSZMB, first(VSZMB) AS VSZMB first(type) AS type by PID _time | stats sum(pctCPU) AS pctCPU, sum(RSZMB) AS RSZMB, sum(VSZMB) AS VSZMB by type, _time | bin _time span=4h | sistats avg(pctCPU), median(pctCPU), median(RSZMB), median(VSZMB) by type, _time
over the time range: (earliest indexed event) – 1/1/01 12:00:00.000 AM
did not return any data. Possible solutions are to: relax the primary search criteria
widen the time range of the search
check that the default search indexes for your account include the desired indexes
The following messages were returned by the search subsystem: WARN: Search filters specified using splunk
server/splunkservergroup do not match any search peer.

Search filters specified using splunkserver/splunkserver_group do not match any search peer.

0 Karma