I have been trying to troubleshoot my deployment which is not currently working properly (receiving on-going Search peer ip-172-31-18-186 has the following message: Too many streaming errors to target=172.31.25.77:9998. Not rolling hot buckets on further errors to this target
messages) by tailing the splunkd.logs on both an indexer and a search-head cluster. On a search-head cluster member all I am getting in the splunkd.log are these records:
04-09-2015 00:37:58.100 +0000 INFO TcpOutputProc - Connected to idx=172.31.25.77:9998
04-09-2015 00:38:28.143 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.120:9998
04-09-2015 00:38:58.195 +0000 INFO TcpOutputProc - Connected to idx=172.31.26.200:9998
04-09-2015 00:39:28.213 +0000 INFO TcpOutputProc - Connected to idx=172.31.22.253:9998
04-09-2015 00:39:58.232 +0000 INFO TcpOutputProc - Connected to idx=172.31.25.228:9998
04-09-2015 00:40:28.303 +0000 INFO TcpOutputProc - Connected to idx=172.31.20.173:9998
04-09-2015 00:40:58.322 +0000 INFO TcpOutputProc - Connected to idx=172.31.25.228:9998
04-09-2015 00:41:28.387 +0000 INFO TcpOutputProc - Connected to idx=172.31.18.186:9998
04-09-2015 00:41:58.461 +0000 INFO TcpOutputProc - Connected to idx=172.31.29.149:9998
[Note: these are the IPs of my indexers]
I get varied records on an indexer in its splunkd.log - here is a snippet:
04-09-2015 00:45:58.407 +0000 INFO CMRepJob - job=CMReplicationErrorJob bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 succeeded
04-09-2015 00:47:08.223 +0000 INFO CMReplicationRegistry - Starting replication: bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF target=5BB335C9-340F-42CD-A5C6-C8269429D10A
04-09-2015 00:47:08.223 +0000 INFO BucketReplicator - event=asyncReplicateBucket bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998
04-09-2015 00:47:08.223 +0000 INFO BucketReplicator - bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF earliest=1428291587 latest=1428291724 type=2
04-09-2015 00:47:08.223 +0000 INFO BucketReplicator - Created asyncReplication task to replicate bucket _audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:08.223 +0000 INFO BucketReplicator - event=startBucketReplication bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:08.223 +0000 INFO BucketReplicator - Starting replication of bucket=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to 172.31.22.253:9998;
04-09-2015 00:47:08.223 +0000 INFO BucketReplicator - Replicating warm bucket=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF node=guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:08.224 +0000 INFO BucketReplicator - event=finishBucketReplication bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF [et=1428291587 lt=1428291724 type=2]
04-09-2015 00:47:08.224 +0000 INFO BucketReplicator - event=localReplicationFinished type=warm bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.902 +0000 INFO CMReplicationRegistry - Starting replication: bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF target=5BB335C9-340F-42CD-A5C6-C8269429D10A
04-09-2015 00:47:58.902 +0000 INFO BucketReplicator - event=asyncReplicateBucket bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998
04-09-2015 00:47:58.902 +0000 INFO BucketReplicator - bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF earliest=1428289848 latest=1428291580 type=2
04-09-2015 00:47:58.902 +0000 INFO BucketReplicator - Created asyncReplication task to replicate bucket _audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.902 +0000 INFO BucketReplicator - event=startBucketReplication bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.902 +0000 INFO BucketReplicator - Starting replication of bucket=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF to 172.31.22.253:9998;
04-09-2015 00:47:58.903 +0000 INFO BucketReplicator - Replicating warm bucket=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF node=guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.903 +0000 INFO BucketReplicator - event=finishBucketReplication bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF [et=1428289848 lt=1428291580 type=2]
04-09-2015 00:47:58.903 +0000 INFO BucketReplicator - event=localReplicationFinished type=warm bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:48:08.228 +0000 WARN BucketReplicator - Replication connection to ip=172.31.22.253:9998 timed out
04-09-2015 00:48:08.228 +0000 WARN BucketReplicator - Connection failed
04-09-2015 00:48:08.228 +0000 INFO BucketReplicator - Discarding replication data as QueueRef=guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF is deleted
04-09-2015 00:48:08.228 +0000 WARN BucketReplicator - Failed to replicate warm bucket bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998. Connection failed
04-09-2015 00:48:08.228 +0000 INFO CMReplicationRegistry - Finished replication: bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF target=5BB335C9-340F-42CD-A5C6-C8269429D10A
04-09-2015 00:48:08.228 +0000 INFO CMSlave - bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF tgt=5BB335C9-340F-42CD-A5C6-C8269429D10A failing=5BB335C9-340F-42CD-A5C6-C8269429D10A queued replication error job
04-09-2015 00:48:08.230 +0000 INFO CMRepJob - job=CMReplicationErrorJob bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF tgtGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A succeeded
Looks to me like you have your replication port and SPLUNKTCP port configured as 9998. These need to be on two different ports.
Your SplunkTCP port (input) will be what the clients use to connect to your Splunk indexers to send their Splunk-cooked data.
Your indexer replication port has to be different. This port is defined in server.conf, or when you enable Clustering as a Peer. Change the replication port across your installation to a different port, say 9890, restart and, and see if this clears up.
Looks to me like you have your replication port and SPLUNKTCP port configured as 9998. These need to be on two different ports.
Your SplunkTCP port (input) will be what the clients use to connect to your Splunk indexers to send their Splunk-cooked data.
Your indexer replication port has to be different. This port is defined in server.conf, or when you enable Clustering as a Peer. Change the replication port across your installation to a different port, say 9890, restart and, and see if this clears up.
Thanks esix_splunk - this fixed the issue. I was reviewing the following documentation from the Best Practices: Forward Search Head internal data to Indexer Layer:
Here is an example outputs.conf file:
[indexAndForward]
index = false
[tcpout]
defaultGroup = my_search_peers
forwardedindex.filter.disable = true
indexAndForward = false
[tcpout:my_search_peers]
server=10.10.10.1:9997,10.10.10.2:9997,10.10.10.3:9997
autoLB = true
This example assumes that each indexer's receiving port is set to 9997.
For details on configuring outputs.conf, read "Configure forwarders with outputs.conf" in the Forwarding Data manual.
I'd respectfully suggest that this documentation entry specifically highlight making sure to use a different port for this implementation of forwarding the internal data from the index layer's replication port. It wasn't really very clear to me, perhaps also to others as I saw there were a few posts around this same topic, so adding that additional configuration information should put this issue to bed for a long, long time.