Deployment Architecture

In a properly configured and operating splunk set-up to forward the internal events of a search-head cluster to an index cluster, what kind of log entries should be occurring, in which logs, and on which components of a distributed splunk deployment?

transtrophe
Communicator

I have been trying to troubleshoot my deployment which is not currently working properly (receiving on-going Search peer ip-172-31-18-186 has the following message: Too many streaming errors to target=172.31.25.77:9998. Not rolling hot buckets on further errors to this target messages) by tailing the splunkd.logs on both an indexer and a search-head cluster. On a search-head cluster member all I am getting in the splunkd.log are these records:

04-09-2015 00:37:58.100 +0000 INFO  TcpOutputProc - Connected to idx=172.31.25.77:9998
04-09-2015 00:38:28.143 +0000 INFO  TcpOutputProc - Connected to idx=172.31.20.120:9998
04-09-2015 00:38:58.195 +0000 INFO  TcpOutputProc - Connected to idx=172.31.26.200:9998
04-09-2015 00:39:28.213 +0000 INFO  TcpOutputProc - Connected to idx=172.31.22.253:9998
04-09-2015 00:39:58.232 +0000 INFO  TcpOutputProc - Connected to idx=172.31.25.228:9998
04-09-2015 00:40:28.303 +0000 INFO  TcpOutputProc - Connected to idx=172.31.20.173:9998
04-09-2015 00:40:58.322 +0000 INFO  TcpOutputProc - Connected to idx=172.31.25.228:9998
04-09-2015 00:41:28.387 +0000 INFO  TcpOutputProc - Connected to idx=172.31.18.186:9998
04-09-2015 00:41:58.461 +0000 INFO  TcpOutputProc - Connected to idx=172.31.29.149:9998

[Note: these are the IPs of my indexers]

I get varied records on an indexer in its splunkd.log - here is a snippet:

04-09-2015 00:45:58.407 +0000 INFO  CMRepJob - job=CMReplicationErrorJob bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF tgtGuid=78E65DE9-B82B-4F0A-A383-D0BC1189F9A3 succeeded
04-09-2015 00:47:08.223 +0000 INFO  CMReplicationRegistry - Starting replication: bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF target=5BB335C9-340F-42CD-A5C6-C8269429D10A
04-09-2015 00:47:08.223 +0000 INFO  BucketReplicator - event=asyncReplicateBucket bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998
04-09-2015 00:47:08.223 +0000 INFO  BucketReplicator - bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF earliest=1428291587 latest=1428291724 type=2
04-09-2015 00:47:08.223 +0000 INFO  BucketReplicator - Created asyncReplication task to replicate bucket _audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:08.223 +0000 INFO  BucketReplicator - event=startBucketReplication bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:08.223 +0000 INFO  BucketReplicator - Starting replication of bucket=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to 172.31.22.253:9998; 
04-09-2015 00:47:08.223 +0000 INFO  BucketReplicator - Replicating warm bucket=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF node=guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:08.224 +0000 INFO  BucketReplicator - event=finishBucketReplication bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF [et=1428291587 lt=1428291724 type=2]
04-09-2015 00:47:08.224 +0000 INFO  BucketReplicator - event=localReplicationFinished type=warm bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.902 +0000 INFO  CMReplicationRegistry - Starting replication: bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF target=5BB335C9-340F-42CD-A5C6-C8269429D10A
04-09-2015 00:47:58.902 +0000 INFO  BucketReplicator - event=asyncReplicateBucket bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998
04-09-2015 00:47:58.902 +0000 INFO  BucketReplicator - bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF earliest=1428289848 latest=1428291580 type=2
04-09-2015 00:47:58.902 +0000 INFO  BucketReplicator - Created asyncReplication task to replicate bucket _audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.902 +0000 INFO  BucketReplicator - event=startBucketReplication bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.902 +0000 INFO  BucketReplicator - Starting replication of bucket=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF to 172.31.22.253:9998; 
04-09-2015 00:47:58.903 +0000 INFO  BucketReplicator - Replicating warm bucket=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF node=guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:47:58.903 +0000 INFO  BucketReplicator - event=finishBucketReplication bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF [et=1428289848 lt=1428291580 type=2]
04-09-2015 00:47:58.903 +0000 INFO  BucketReplicator - event=localReplicationFinished type=warm bid=_audit~86~37D7692E-5D49-432E-9A6F-89C0C68FACEF
04-09-2015 00:48:08.228 +0000 WARN  BucketReplicator - Replication connection to ip=172.31.22.253:9998 timed out
04-09-2015 00:48:08.228 +0000 WARN  BucketReplicator - Connection failed
04-09-2015 00:48:08.228 +0000 INFO  BucketReplicator - Discarding replication data as QueueRef=guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998 bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF is deleted
04-09-2015 00:48:08.228 +0000 WARN  BucketReplicator - Failed to replicate warm bucket bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF to guid=5BB335C9-340F-42CD-A5C6-C8269429D10A host=172.31.22.253 s2sport=9998. Connection failed
04-09-2015 00:48:08.228 +0000 INFO  CMReplicationRegistry - Finished replication: bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF target=5BB335C9-340F-42CD-A5C6-C8269429D10A
04-09-2015 00:48:08.228 +0000 INFO  CMSlave - bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF src=37D7692E-5D49-432E-9A6F-89C0C68FACEF tgt=5BB335C9-340F-42CD-A5C6-C8269429D10A failing=5BB335C9-340F-42CD-A5C6-C8269429D10A queued replication error job
04-09-2015 00:48:08.230 +0000 INFO  CMRepJob - job=CMReplicationErrorJob bid=_audit~87~37D7692E-5D49-432E-9A6F-89C0C68FACEF failingGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A srcGuid=37D7692E-5D49-432E-9A6F-89C0C68FACEF tgtGuid=5BB335C9-340F-42CD-A5C6-C8269429D10A succeeded
0 Karma
1 Solution

esix_splunk
Splunk Employee
Splunk Employee

Looks to me like you have your replication port and SPLUNKTCP port configured as 9998. These need to be on two different ports.

Your SplunkTCP port (input) will be what the clients use to connect to your Splunk indexers to send their Splunk-cooked data.

Your indexer replication port has to be different. This port is defined in server.conf, or when you enable Clustering as a Peer. Change the replication port across your installation to a different port, say 9890, restart and, and see if this clears up.

View solution in original post

esix_splunk
Splunk Employee
Splunk Employee

Looks to me like you have your replication port and SPLUNKTCP port configured as 9998. These need to be on two different ports.

Your SplunkTCP port (input) will be what the clients use to connect to your Splunk indexers to send their Splunk-cooked data.

Your indexer replication port has to be different. This port is defined in server.conf, or when you enable Clustering as a Peer. Change the replication port across your installation to a different port, say 9890, restart and, and see if this clears up.

transtrophe
Communicator

Thanks esix_splunk - this fixed the issue. I was reviewing the following documentation from the Best Practices: Forward Search Head internal data to Indexer Layer:

  1. Configure the search head as a forwarder. Create an outputs.conf file on the search head that configures the search head for load-balanced forwarding across the set of search peers (indexers). You must also turn off indexing on the search head, so that the search head does not both retain the data locally as well as forward it to the search peers.

Here is an example outputs.conf file:

Turn off indexing on the search head

[indexAndForward]
index = false

[tcpout]
defaultGroup = my_search_peers
forwardedindex.filter.disable = true

indexAndForward = false

[tcpout:my_search_peers]
server=10.10.10.1:9997,10.10.10.2:9997,10.10.10.3:9997
autoLB = true
This example assumes that each indexer's receiving port is set to 9997.

For details on configuring outputs.conf, read "Configure forwarders with outputs.conf" in the Forwarding Data manual.

I'd respectfully suggest that this documentation entry specifically highlight making sure to use a different port for this implementation of forwarding the internal data from the index layer's replication port. It wasn't really very clear to me, perhaps also to others as I saw there were a few posts around this same topic, so adding that additional configuration information should put this issue to bed for a long, long time.

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...