We've migrated some search heads, i've deleted the indexer peers, readded the cluster masters and everything looks fine and healthy ...until suddenly it doesn't.
I've read the installation guide and i don't see anything wrong with what I've done.
I've gone inside server.conf in the search heads and uncommented the distributed search servers. I've gone through the gui and added the cluster masters for the 2 clusters to be searched.
In the gui, under distributed search, I see this (for all 4 indexers):
Error [00000010] Instance name "splunk01.domain.com" Search head's authentication credentials rejected by peer. Try re-adding the peer. Last Connect Time:2017-03-24T03:43:41.000+00:00; Failed 10 out of 10 times.
and under indexer clustering, everything looks hunky dory:
Cluster Master Searchable Search Factor Replication Factor Status Actions
https://10.1.2.20:8089 All Data is Searchable Met Met
And here's the kicker: Restarting splunk FIXES it temporarily! but less than an hour later, the problem resurfaces...
I don't know causes this and i'm starting to think it's a bug, because i've triple checked my configs and everything looks fine...
Any clues?
Thanks to Woodcock, So I just realized something. We added 2 searchheads to the cluster in a 24h span. splsearch01.domain1.com and splsearch01.domain2.com
As you can see, BOTH searchheads’s hostname is splsearch01 , so these problems arise because they kept overwriting each other’s pem files so only one searchhead could be authenticated at any given time. None of this was logged as an error or warning by splunk…so it was effectively invisible.
In other words, this was all caused by the COINCIDENCE that both searchheads we just installed shared the same name because splunk didn’t go and get an FQDN (fully qualified domain name) by default when it was installed. It just used whatever the output of the command “hostname” was and left it at that. That’s right. It didn’t use a unique identifier. It assumed the output of the command hostname would be unique across all splunk instances….why splunk, why?
I went and manually added the fqdn to server.conf on both searchheads so splunk is identifying them by their UNIQUE fqdn, and I’ve verified that the pem files are now be stored in DIFFERENT directories so no pem files will be overwritten.
Going forward, we’ll have to take this into account and rename each host with the fqdn (hostname bla.xxx.cequintecid.com) before installing splunk. (because splunk won't do this for us...it could have easily done so using facter...) This should prevent similar situations from ever arising again.
Ugh! I figured out how I broke this. I have a large linux host and I stacked 3 search heads on it in different directories with unique ports. One for ITSI, one for plain vanilla splunk, and one as a prod copy/staging. Basically 3 test environments. The trouble was that I set up the 2nd two quickly to help a colleague troubleshoot another issue and I didn't change the Splunk server name which defaults to the short host name of the server. Long story short, all 3 instances were identifying themselves as the same splunk server to the same search peers resulting in authentication working for a little while then breaking randomly. I imagine this could also happen to someone by cloning a search head to another host and not making the Splunk server name unique and then connecting up to the same search peers. I hope this saves someone else some frustration. In my case, there were no search clusters or index replications clusters. Possibly a future update to Splunk that warns that the name is already in use by another search head?
in server.conf
[general]
serverName = unique_name_here
while you're at it, you might as well name uniquely in inputs.conf as well.
[default]
host = unique_name_here
So it really was the exact same problem that the OP had. When you run multiple instances on a single server, it is IMPERATIVE that you set unique serverName
values inside of server.conf
for each instance to avoid this (and other) problems. If you had mentioned a multiple-instance server, I would have immediately known that this was the problem.
Thanks to Woodcock, So I just realized something. We added 2 searchheads to the cluster in a 24h span. splsearch01.domain1.com and splsearch01.domain2.com
As you can see, BOTH searchheads’s hostname is splsearch01 , so these problems arise because they kept overwriting each other’s pem files so only one searchhead could be authenticated at any given time. None of this was logged as an error or warning by splunk…so it was effectively invisible.
In other words, this was all caused by the COINCIDENCE that both searchheads we just installed shared the same name because splunk didn’t go and get an FQDN (fully qualified domain name) by default when it was installed. It just used whatever the output of the command “hostname” was and left it at that. That’s right. It didn’t use a unique identifier. It assumed the output of the command hostname would be unique across all splunk instances….why splunk, why?
I went and manually added the fqdn to server.conf on both searchheads so splunk is identifying them by their UNIQUE fqdn, and I’ve verified that the pem files are now be stored in DIFFERENT directories so no pem files will be overwritten.
Going forward, we’ll have to take this into account and rename each host with the fqdn (hostname bla.xxx.cequintecid.com) before installing splunk. (because splunk won't do this for us...it could have easily done so using facter...) This should prevent similar situations from ever arising again.
@woodcock Thanks a lot this helped to troubleshoot one of internal issue.
Don't forget to UpVote
then!
You should file a bug on this.
I have the same issue with 1 search head and 1 indexer. I can add Peer but the console still has the same message.
The trusted.pem is getting passed on to the indexer correctly and the url -k -u admin:password https://indexer:8089/services/server/info give me the server info.
Can anyone help.
Details
Search head splunkd log
07-09-2018 17:21:46.755 +0000 WARN GetRemoteAuthToken - Unable to get authentication token from peeruri="https://indexer1.dev:8089/services/admin/auth-tokens".
07-09-2018 17:21:46.755 +0000 WARN DistributedPeer - Peer:https://indexer41.dev:8089 Authentication Failed
Indexer splunkd.log
07-09-2018 17:21:46.747 +0000 WARN AdminHandler:AuthenticationHandler - Denied session token for user: splunk-system-user
07-09-2018 17:21:46.755 +0000 WARN AdminHandler:AuthenticationHandler - Denied session token for user: splunk-system-user
I'm having the exact same issue and my 10 search peers all have both unique short and FQDN names.
I've got nothing. I would open support case.
It is not entirely necessary to do this through the GUI; you can manually configure a search peer as follows:
On your Search Head, get a copy of this file:
$SPLUNK_HOME/etc/auth/distServerKeys/trusted.pem
Also modify this file and add in the new Indexer (it might be in a different location so poke around):
$SPLUNK_HOME/etc/system/local/distsearch.conf
Also get the hostname of the Search Head with this command:
hostname
On your Indexer(s), go to this directory:
$SPLUNK_HOME/etc/auth/distServerKeys/
Create a directory there named with the name of your Search Head's hostname and put the trusted.pem
file from the Search Head there.
Great to see command line option. Wished Splunk pushed this also into "apps", so we can use central app distribution framework for hands-free addition, rather than touching the etc/auth location 😞
I agree, but...
Wow..you answer made me realize something. we added 2 searchheads almost simultaneously, which were named sh1.aqx.com and sh1.oly.com. Splunk in it's wisdom only used the first part to identify them in server.conf...both were named sh1.
So my guess is each time I added the cluster to one of them, its pem overwrote the existing pem under /distServerKeys/sh1/ which obviously caused the re-authentication for the one whose pem was overwritten to fail.
This could all have been avoided if splunk used a unique identifier (such as an IP or a FQDN) when deciding what to call itself, instead of just assuming the output of the command "hostname" would be unique across the entire environment.
There you go. Teamwork!
It looks like the peers you are talking about are indexers part of a cluster.
I understand you are modifying the peer list on SH.
You should never add or delete them on the search head as the CM will give the list to the search head in a way that replicated buckets are only used once.
To fix it, I would :
make a backup ,
remove the SH from the CM (from the SH, by commenting out the ref to CM in server.conf then restart splunk on sh)
remove all search peers (indexers) from the SH (via gui)
uncomment the CM line in server.conf + restart
test that you can :
see the SH in the CM list
see peers under the SH (but don't modify)
test that from a search you don't see data twice (that would mean peers are defined as static)
if you are still loosing connectivity after a while :
check that everything is NTP synchronized
check load on indexer + if you are in a virtualized test env where something else load the hosts , increase the HTTP timeout on SH and idx.(if indexer is too slow, the SH will think the auth failed)
if you are sending too many searches simultaneously on the indexers (ie more than their capacity), reduce the number of searches in // on the SH
I've already performed the steps you describe to clean the config and re-add the cluster masters. When I say I re-added the peers, i mean i did it by adding the CM. Not manually. That didn't help. ntp is enabled and this is not a load issue. the indexers have light workloads. any other ideas?
Is the secret key same between clusters and search head?
if i use the wrong password when adding a searchhead to a cluster, it doesn't let me add it to begin with. full stop. I also mentioned it works fine for a while after I restart splunk. This would not happen if the secret key were different, is it not so?
In which files, and which stanzas am I supposed to check if the secret keys are the same?