Security

Search head's authentication credentials rejected by peer. I've re-added peers 4 times and this keeps happening!

gozulin
Communicator

We've migrated some search heads, i've deleted the indexer peers, readded the cluster masters and everything looks fine and healthy ...until suddenly it doesn't.

I've read the installation guide and i don't see anything wrong with what I've done.

I've gone inside server.conf in the search heads and uncommented the distributed search servers. I've gone through the gui and added the cluster masters for the 2 clusters to be searched.

In the gui, under distributed search, I see this (for all 4 indexers):

Error [00000010] Instance name "splunk01.domain.com" Search head's authentication credentials rejected by peer. Try re-adding the peer. Last Connect Time:2017-03-24T03:43:41.000+00:00; Failed 10 out of 10 times.

and under indexer clustering, everything looks hunky dory:

Cluster Master       Searchable           Search Factor Replication Factor  Status  Actions
https://10.1.2.20:8089  All Data is Searchable   Met               Met  

And here's the kicker: Restarting splunk FIXES it temporarily! but less than an hour later, the problem resurfaces...

I don't know causes this and i'm starting to think it's a bug, because i've triple checked my configs and everything looks fine...

Any clues?

1 Solution

gozulin
Communicator

Thanks to Woodcock, So I just realized something. We added 2 searchheads to the cluster in a 24h span. splsearch01.domain1.com and splsearch01.domain2.com

As you can see, BOTH searchheads’s hostname is splsearch01 , so these problems arise because they kept overwriting each other’s pem files so only one searchhead could be authenticated at any given time. None of this was logged as an error or warning by splunk…so it was effectively invisible.

In other words, this was all caused by the COINCIDENCE that both searchheads we just installed shared the same name because splunk didn’t go and get an FQDN (fully qualified domain name) by default when it was installed. It just used whatever the output of the command “hostname” was and left it at that. That’s right. It didn’t use a unique identifier. It assumed the output of the command hostname would be unique across all splunk instances….why splunk, why?

I went and manually added the fqdn to server.conf on both searchheads so splunk is identifying them by their UNIQUE fqdn, and I’ve verified that the pem files are now be stored in DIFFERENT directories so no pem files will be overwritten.

Going forward, we’ll have to take this into account and rename each host with the fqdn (hostname bla.xxx.cequintecid.com) before installing splunk. (because splunk won't do this for us...it could have easily done so using facter...) This should prevent similar situations from ever arising again.

View solution in original post

bandit
Motivator

Ugh! I figured out how I broke this. I have a large linux host and I stacked 3 search heads on it in different directories with unique ports. One for ITSI, one for plain vanilla splunk, and one as a prod copy/staging. Basically 3 test environments. The trouble was that I set up the 2nd two quickly to help a colleague troubleshoot another issue and I didn't change the Splunk server name which defaults to the short host name of the server. Long story short, all 3 instances were identifying themselves as the same splunk server to the same search peers resulting in authentication working for a little while then breaking randomly. I imagine this could also happen to someone by cloning a search head to another host and not making the Splunk server name unique and then connecting up to the same search peers. I hope this saves someone else some frustration. In my case, there were no search clusters or index replications clusters. Possibly a future update to Splunk that warns that the name is already in use by another search head?

in server.conf

[general]
serverName = unique_name_here

while you're at it, you might as well name uniquely in inputs.conf as well.

[default]
host = unique_name_here
0 Karma

woodcock
Esteemed Legend

So it really was the exact same problem that the OP had. When you run multiple instances on a single server, it is IMPERATIVE that you set unique serverName values inside of server.conf for each instance to avoid this (and other) problems. If you had mentioned a multiple-instance server, I would have immediately known that this was the problem.

0 Karma

gozulin
Communicator

Thanks to Woodcock, So I just realized something. We added 2 searchheads to the cluster in a 24h span. splsearch01.domain1.com and splsearch01.domain2.com

As you can see, BOTH searchheads’s hostname is splsearch01 , so these problems arise because they kept overwriting each other’s pem files so only one searchhead could be authenticated at any given time. None of this was logged as an error or warning by splunk…so it was effectively invisible.

In other words, this was all caused by the COINCIDENCE that both searchheads we just installed shared the same name because splunk didn’t go and get an FQDN (fully qualified domain name) by default when it was installed. It just used whatever the output of the command “hostname” was and left it at that. That’s right. It didn’t use a unique identifier. It assumed the output of the command hostname would be unique across all splunk instances….why splunk, why?

I went and manually added the fqdn to server.conf on both searchheads so splunk is identifying them by their UNIQUE fqdn, and I’ve verified that the pem files are now be stored in DIFFERENT directories so no pem files will be overwritten.

Going forward, we’ll have to take this into account and rename each host with the fqdn (hostname bla.xxx.cequintecid.com) before installing splunk. (because splunk won't do this for us...it could have easily done so using facter...) This should prevent similar situations from ever arising again.

athorat
Communicator

@woodcock Thanks a lot this helped to troubleshoot one of internal issue.

woodcock
Esteemed Legend

Don't forget to UpVote then!

0 Karma

woodcock
Esteemed Legend

You should file a bug on this.

aksharp
Explorer

I have the same issue with 1 search head and 1 indexer. I can add Peer but the console still has the same message.
The trusted.pem is getting passed on to the indexer correctly and the url -k -u admin:password https://indexer:8089/services/server/info give me the server info.
Can anyone help.

Details
Search head splunkd log

07-09-2018 17:21:46.755 +0000 WARN GetRemoteAuthToken - Unable to get authentication token from peeruri="https‍://indexer1.dev:8089/services/admin/auth-tokens".
07-09-2018 17:21:46.755 +0000 WARN DistributedPeer - Peer:https‍://indexer41.dev:8089 Authentication Failed
Indexer splunkd.log

07-09-2018 17:21:46.747 +0000 WARN AdminHandler:AuthenticationHandler - Denied session token for user: splunk-system-user
07-09-2018 17:21:46.755 +0000 WARN AdminHandler:AuthenticationHandler - Denied session token for user: splunk-system-user

0 Karma

bandit
Motivator

I'm having the exact same issue and my 10 search peers all have both unique short and FQDN names.

0 Karma

woodcock
Esteemed Legend

I've got nothing. I would open support case.

0 Karma

woodcock
Esteemed Legend

It is not entirely necessary to do this through the GUI; you can manually configure a search peer as follows:

On your Search Head, get a copy of this file:

$SPLUNK_HOME/etc/auth/distServerKeys/trusted.pem

Also modify this file and add in the new Indexer (it might be in a different location so poke around):

$SPLUNK_HOME/etc/system/local/distsearch.conf

Also get the hostname of the Search Head with this command:

hostname

On your Indexer(s), go to this directory:

$SPLUNK_HOME/etc/auth/distServerKeys/

Create a directory there named with the name of your Search Head's hostname and put the trusted.pem file from the Search Head there.

koshyk
Super Champion

Great to see command line option. Wished Splunk pushed this also into "apps", so we can use central app distribution framework for hands-free addition, rather than touching the etc/auth location 😞

0 Karma

woodcock
Esteemed Legend

I agree, but...

0 Karma

gozulin
Communicator

Wow..you answer made me realize something. we added 2 searchheads almost simultaneously, which were named sh1.aqx.com and sh1.oly.com. Splunk in it's wisdom only used the first part to identify them in server.conf...both were named sh1.

So my guess is each time I added the cluster to one of them, its pem overwrote the existing pem under /distServerKeys/sh1/ which obviously caused the re-authentication for the one whose pem was overwritten to fail.

This could all have been avoided if splunk used a unique identifier (such as an IP or a FQDN) when deciding what to call itself, instead of just assuming the output of the command "hostname" would be unique across the entire environment.

woodcock
Esteemed Legend

There you go. Teamwork!

0 Karma

maraman_splunk
Splunk Employee
Splunk Employee

It looks like the peers you are talking about are indexers part of a cluster.
I understand you are modifying the peer list on SH.
You should never add or delete them on the search head as the CM will give the list to the search head in a way that replicated buckets are only used once.
To fix it, I would :
make a backup ,
remove the SH from the CM (from the SH, by commenting out the ref to CM in server.conf then restart splunk on sh)
remove all search peers (indexers) from the SH (via gui)
uncomment the CM line in server.conf + restart

test that you can :
see the SH in the CM list
see peers under the SH (but don't modify)
test that from a search you don't see data twice (that would mean peers are defined as static)

if you are still loosing connectivity after a while :
check that everything is NTP synchronized
check load on indexer + if you are in a virtualized test env where something else load the hosts , increase the HTTP timeout on SH and idx.(if indexer is too slow, the SH will think the auth failed)
if you are sending too many searches simultaneously on the indexers (ie more than their capacity), reduce the number of searches in // on the SH

0 Karma

gozulin
Communicator

I've already performed the steps you describe to clean the config and re-add the cluster masters. When I say I re-added the peers, i mean i did it by adding the CM. Not manually. That didn't help. ntp is enabled and this is not a load issue. the indexers have light workloads. any other ideas?

0 Karma

somesoni2
Revered Legend

Is the secret key same between clusters and search head?

0 Karma

gozulin
Communicator

if i use the wrong password when adding a searchhead to a cluster, it doesn't let me add it to begin with. full stop. I also mentioned it works fine for a while after I restart splunk. This would not happen if the secret key were different, is it not so?

In which files, and which stanzas am I supposed to check if the secret keys are the same?

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...