Re: Getting an Error setting up Clustering

rmcdougal · ‎11-12-2012

I am attempting to create a cluster but I am receiving an error when I attempt to add a peer (any peer for that matter). My setup looks like this 1 VM serving as the MasterNode and 2 physical indexers. So my replication factor is setup at 2 and the search factor is also at 2. When I attempt to add one of the physical indexers to the cluster this is the message I receive.

failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers master=https://VM-SplunkMN:8089 rv=0 actual_response_code=500 expected_response_code=201 status_line=HTTP/1.1 500 Internal Server Error [ event=addPeer status=retrying replication_address= forwarder_address= search_address= mgmtPort=8089 rawPort=9913 useSSL=false forwarderPort=9910 forwarderPortUseSSL=false serverName=SPLUNK1 activeBundleId=9d924c537e9dea196053cd549f82fbbd status=Up type=Initial-Add baseGen=0 ]

The only thing that stands out to me is the forwarderPort=9910 entry. That is the port I use for server forwarders, not sure why it would show up here.

thambisetty · ‎07-07-2023

I faced the exact same issue in one of the multi-site indexer clusters when I upgraded the indexer cluster from version 9.0.x to 9.0.5.

After upgrading Splunk on the indexer, the virtual machine (VM) running the indexer unexpectedly went down. When I restarted the VM, I discovered that the Splunk service was already running, and the version displayed was the latest one. However, I failed to notice that it was experiencing problems connecting to the cluster manager.

I completed the upgrade, but after a few days (around 15 days), the vulnerability management team requested another Splunk version upgrade. When I checked the Splunk version using the command, it displayed version 9.0.5. However, upon inspecting the $SPLUNK_HOME/etc/splunk.version file, I found that it still had the old version, indicating an unsuccessful upgrade.

Realizing this, put the cluster master in maintenance mode, I stopped the Splunk service on the faulty indexer, cleared the standalone buckets using the commands mentioned below. Unfortunately, while restarting the Splunk service on the faulty indexer, the server went down again.

# finding standaralone buckets
find $SPLUNK_DB/ -type d -name "db*" | grep -P "db_\d*_\d*_\d*$"
#converting standardalone buckets to clustered buckets
# 5A0E298B-0AFB-4d56-9dD0-A64dfdfd19DA8 is the GUID of cluster manager(master)
find $SPLUNK_DB/ -type d -name "db_*" | grep -P "db_\d*_\d*_\d*$" |xargs -I {} mv {} {}_5A0E298B-0AFB-4d56-9dD0-A64dfdfd19DA8

I repeated this process two to three times, but it did not resolve the issue.

Finally, I cleared the $SPLUNK_HOME/etc/instances.cfg file on the faulty indexer and restarted the service. This time, the indexer successfully joined the cluster.

————————————
If this helps, give a like below.

joechakkola1 · ‎11-09-2018

i had similar errors . i was able to resolve it by changing the replication port number. the issue was that , i had replication port and the receiving port as the same ( 9997) . after i dedicated port 9887 for replication under server.conf ( [replication_port://9887]) and restarted indexers and cluster master , the issue was resolved .

schose · ‎07-27-2016

Hi all,

We had the same issue with a faulty bucket .. we see the name at whe Messages in the webgui.. moved the bucket, run splunk fsck..

solves the issue.

Cheers,

Andreas

greich · ‎01-22-2014

have same issue, on an indexer that had to be taken out of cluster for a while when trying to rejoin.
did touch the instance.cfg which contain the same value as displayed on the cluster master.

svenwendler · ‎03-20-2013

I received the same error and I had no connectivity issues.

My chosen method of distribution was by installing one instance then copying the binaries to the other servers

I changed the Server name in etc/system/local/server.conf
but I missed something else - more on this later

I had a hunch that it was something to do with an id of the server so I went ahead and installed splunk on each of the servers one by one.

I created the cluster again and had no problems.

Further investigation into why it didn't work led me to:
/proj/splunk/splunk/etc/instance.cfg
If I had changed the guid in that to something unique on each server then I reckon it would have worked

goelt2000 · ‎02-24-2020

Thanks! I had the same issue. I was using Amazon AMIs to launch a indexer cluster comprising of 3 Peer Indexers, 1 Master Indexer and 1 Search Head. your answer resolved my issue.
However, I see that the master indexer node has two search heads, and is registering itself as a search head too in addition to what I gave separately.

ghendrey_splunk · ‎05-18-2016

in the search UI you will see a system message like this:
Failed to add peer 'guid=02E2B503-8C98-4690-BD9C-ABAB937BDAE4 server name=indexpeer ip=192.168.1.69:8089' to the master. Error=Cannot register a peer with the master's guid.

You are correct, the two systems have the same guid in instance.cfg and that must be causing the problem

The rest endpoint (/services/cluster/master/peers) should be returning a meaningful error message, and it is not. So if you are debugging this on the slave, all you see is this:
"05-18-2016 16:45:22.102 -0700 WARN CMSlave - Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=ghendrey-mbp.local:8092 rv=0 actual_response_code=500 expected_response_code=201 status_line=Internal Server Error error=No error [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=488D0EABB38D6873F00907580854C72D add_type=Initial-Add base_generation_id=0 latest_bundle_id=488D0EABB38D6873F00907580854C72D mgmt_port=8089 name=02E2B503-8C98-4690-BD9C-ABAB937BDAE4 register_forwarder_address= register_replication_address= register_search_address= replication_port=34572 replication_use_ssl=0 replications= server_name=indexpeer site=default splunk_version=6.4.0 splunkd_build_number=dbd9c8b7bedfe28e2ed0a9140fca47225309167a status=Up } ]."

ghendrey_splunk · ‎05-18-2016

I deleted the GUID from instance.cfg on peer. New guid was created on restart. Problem solved for me.

jmsiegma · ‎12-06-2012

Folks,

I solved my problem.. Here is how:

I had 4 servers in my splunk farm..
1 - Search Head
1 - Master Cluster
2 - Cluster Peers

I also could not get one of my peers to connect, per the same message.
What it came down to was a firewall blocking communication on the Cluster Peers

so using nmap i validated that the following ports were open:
8000 TCP
8089 TCP

the command I used was 'nmap -sS -p 8000-10000 {IP of cluster peer}

Once I figured that out, everything worked like a champ.

woodcock · ‎05-28-2018

You should click Accept on this answer to close your question. Also, it would help to know what the expected/correct output (and maybe the wrong output) of the command was.

dikaye · ‎11-29-2012

warning info from peer node:

11-30-2012 12:05:54.871 +0800 WARN CMMasterHTTPProxy - failed method=POST path=/services/cluster/master/peers master=https://192.168.102.205:8089 rv=0 actual_response_code=500 expected_response_code=201 status_line=HTTP/1.1 500 Internal Server Error

dikaye · ‎11-29-2012

I got the log like that:

11-30-2012 09:58:48.054 +0800 INFO  CMMaster - Adding bid=_audit~1~D4DDF306-0648-4D7E-98B8-F837F439E6C2 (status='Complete' search_status='Searchable' mask=18446744073709551615 checksum= standalone=yes size=1091 genid=0) to peer=D4DDF306-0648-4D7E-98B8-F837F439E6C2

11-30-2012 09:58:48.054 +0800 ERROR CMMaster - event=addPeer guid=D4DDF306-0648-4D7E-98B8-F837F439E6C2 status=failed err="size=332 already committed"
11-30-2012 09:58:48.054 +0800 INFO CMPeer - removing bid=_audit~1~D4DDF306-0648-4D7E-98B8-F837F439E6C2 from peer=D4DDF306-0648-4D7E-98B8-F837F439E6C2
11-30-2012 09:58:48.054 +0800 INFO CMMaster - event=addBucketToFix bid=_audit~1~D4DDF306-0648-4D7E-98B8-F837F439E6C2 msg='Ignoring standalone bucket'
11-30-2012 09:58:48.054 +0800 ERROR ClusterMasterPeerHandler - Cannot add peer=192.168.102.204 mgmtport=8089 (reason: size=332 already committed)
11-30-2012 09:59:48.093 +0800 INFO ClusterMasterPeerHandler - Add peer info replication_address=192.168.102.204 forwarder_address= search_address= mgmtPort=8089 rawPort=8099 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=splunk-index-02.ntt.com.hk activeBundleId=e42fbfc3436bd89262c70e511d343b91 status=Up type=Initial-Add baseGen=0
11-30-2012 09:59:48.099 +0800 INFO CMMaster - event=removeOldPeer guid=D4DDF306-0648-4D7E-98B8-F837F439E6C2 hostport=192.168.102.204:8089 status=success
11-30-2012 09:59:48.099 +0800 INFO CMMaster - event=addPeer guid=D4DDF306-0648-4D7E-98B8-F837F439E6C2 replication_address=192.168.102.204 forwarder_address= search_address= mgmtPort=8089 rawPort=8099 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=splunk-index-02.ntt.com.hk activeBundleId=e42fbfc3436bd89262c70e511d343b91 status=Up type=Initial-Add baseGen=0 bucket_count=13
11-30-2012 09:59:48.099 +0800 INFO CMMaster - Adding bid=_audit~1~D4DDF306-0648-4D7E-98B8-F837F439E6C2 (status='Complete' search_status='Searchable' mask=18446744073709551615 checksum= standalone=yes size=1091 genid=0) to peer=D4DDF306-0648-4D7E-98B8-F837F439E6C2
11-30-2012 09:59:48.099 +0800 ERROR CMMaster - event=addPeer guid=D4DDF306-0648-4D7E-98B8-F837F439E6C2 status=failed err="size=332 already committed"
11-30-2012 09:59:48.099 +0800 INFO CMPeer - removing bid=_audit~1~D4DDF306-0648-4D7E-98B8-F837F439E6C2 from peer=D4DDF306-0648-4D7E-98B8-F837F439E6C2
11-30-2012 09:59:48.099 +0800 INFO CMMaster - event=addBucketToFix bid=_audit~1~D4DDF306-0648-4D7E-98B8-F837F439E6C2 msg='Ignoring standalone bucket'
11-30-2012 09:59:48.099 +0800 ERROR ClusterMasterPeerHandler - Cannot add peer=192.168.102.204 mgmtport=8089 (reason: size=332 already committed)

tmerenyi · ‎11-23-2012

Hi All,

I have a similar a problem. All of machines are vmware machines.
In my case the rawPort=9887, and the forwarderport=0
Thanks to help me
Tamas

• failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers master=https://192.168.1.73:8089 rv=0 actual_response_code=500 expected_response_code=201 status_line=HTTP/1.1 500 Internal Server Error [ event=addPeer status=retrying replication_address= forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=splunk activeBundleId=1f449698180e6acdd12c2a003de7c242 status=Up type=Initial-Add baseGen=0 ]

jonuwz · ‎11-13-2012

Look in the splunkd.log on the master node, it'll give you more information.

Getting an Error setting up Clustering

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Introducing ITSI 5.0: Unified Visibility and Actionable Insights

Inside Splunk Agent Observability: Understanding Agent Behavior, Tokens & Costs

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Join the Conversation