We're developing a multi-tenant Splunk design using multiple server IPs rather than ports. This has worked great but we're hitting a problem now that we're introducing clustering. Here are some characteristics of the environment:
OS: RHEL6 x64
Splunk Version: 5.0.1
Indexer 1 (idx01a): Server IP 192.168.1.10, Splunk IP 192.168.1.11 (mgmt port 8089, replication port 9887)
Indexer 2 (idx01b): Server IP 192.168.1.20, Splunk IP 192.168.1.21 (mgmt port 8089, replication port 9887)
Cluster Master (mst01a): Server IP 192.168.1.30, Splunk IP 192.168.1.31 (mgmt port 8089)
To be clear, each of these instances runs on a different host but we expect to run multiple masters on a single host and multiple indexers on a single host to meet flexibility and security goals. When indexer 1 is configured to point to the master, messages like this are generated on the master:
01-23-2013 12:09:36.724 -0600 INFO ClusterMasterPeerHandler - Add peer info replication_address=192.168.1.10 forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=idx01a.example.com activeBundleId=9a21b27018937a8e281212a39f2b262f status=Up type=Initial-Add baseGen=0 01-23-2013 12:09:36.725 -0600 ERROR ClusterMasterPeerHandler - Cannot add peer=192.168.1.10 mgmtport=8089 (reason: send failure peer=https://192.168.1.10:8089 rc=2)
Note that the master is trying to send traffic to the server IP rather than the Splunk IP. On the indexers, we've set the HOSTNAME and SPLUNK_BINDIP variables in splunk-launch.conf as well as the mgmtHostPort in web.conf and serverName in server.conf. It looks like the indexer is either telling the master about its "real" hostname, or the master assumes that indexer traffic should be returned to the IP where it originated, regardless of hostname.
We're early enough in implementation that we could abandon the multiple IP design in favor of a single IP, multiple port approach (I confirmed that the cluster forms properly with this approach) but there are several benefits to the current design. Are there any known workarounds for this behavior or is our best bet to redesign around a single IP per host (are we likely to hit other problems down the road)?