Splunk Search

Remove peer from Index Cluster and re-add after maintenance

jiaqya
Builder

I need to take one peer down for maintenance, so i do splunk stop on it.

cluster handles and brings cluster back to valid state with 1 indexer showing "down" or "Stopped"

now the dashboards have all the data showing up, but there is a small yellow triangle indicating one peer is not searchable.

so i removed this peer from cluster master , now the error is gone.

i would be re-adding this after activity.

so would like to understand what happens when i re-add it...

will cluster master identify this peer was removed and was re-added ?

will it treat it like it was down and it came back up now ?

 

Labels (1)
0 Karma
1 Solution

rnowitzki
Builder

Hi @jiaqya ,

Yes, also use "offline" in that case. It does not make a difference if use "stop" or "offline" looking at data that is available/searchable or not. It's just a smoother procedure in terms of closing bucket/idx activity.

Assuming you have RF>1, what happened when you took the one peer down, is that the Buckets on that node that were "primary" (=searchable) are now "primary" on another node, which took probably some time as there was no metadata (idx files) for these buckets on the other node(s). 

I would recommend to set SF to at least 2.

Also see
https://docs.splunk.com/Documentation/Splunk/latest/Indexer/Bucketsandclusters

When you add the peer back to the cluster, the Buckets will be spread again even across the available nodes until RF and SF is met again. Should not cause any harm to the data. 

--
Karma and/or Solution tagging appreciated.

--
Karma and/or Solution tagging appreciated.

View solution in original post

0 Karma

rnowitzki
Builder

Hi @jiaqya ,

You better use "splunk offline" on the Peer where the maintenance is going on.

https://docs.splunk.com/Documentation/Splunk/8.0.5/Indexer/Takeapeeroffline

 "Caution: Do not use splunk stop to take a peer offline. Instead, use splunk offline. It stops the peer in a way that minimizes disruption to your searches"

BR
Ralph

--
Karma and/or Solution tagging appreciated.
0 Karma

jiaqya
Builder

would you still recommend offline command, in case we have SF=1

problem i have now is that dashboard shows yellow messages of peer down, when indexer is down...

so i removed it from cluster so dashboards are clean , but will re-add the peer , would like to understand behavior when re-adding the peer..

Tags (1)
0 Karma

rnowitzki
Builder

Hi @jiaqya ,

Yes, also use "offline" in that case. It does not make a difference if use "stop" or "offline" looking at data that is available/searchable or not. It's just a smoother procedure in terms of closing bucket/idx activity.

Assuming you have RF>1, what happened when you took the one peer down, is that the Buckets on that node that were "primary" (=searchable) are now "primary" on another node, which took probably some time as there was no metadata (idx files) for these buckets on the other node(s). 

I would recommend to set SF to at least 2.

Also see
https://docs.splunk.com/Documentation/Splunk/latest/Indexer/Bucketsandclusters

When you add the peer back to the cluster, the Buckets will be spread again even across the available nodes until RF and SF is met again. Should not cause any harm to the data. 

--
Karma and/or Solution tagging appreciated.

--
Karma and/or Solution tagging appreciated.
0 Karma

jiaqya
Builder

Thanks for the expalanation.

We have SF=1 and probably it will stay that way.  So i think "Stop" is the correct way here.

So, we just bring the peer online once activity is done , and it should re-connect back with cluster, with some bucket fixing activity on cluster side..

Since there is no data issues with this, this should be good for my question.

thanks

 

0 Karma

rnowitzki
Builder

Like I wrote earlier, "stop" is not recommended. Independent from SF.  Use "offline" instead.

BR

Ralph

--
Karma and/or Solution tagging appreciated.
0 Karma

jiaqya
Builder

Hi Ralph, i was looking at this document : https://docs.splunk.com/Documentation/Splunk/8.0.5/Indexer/Takeapeeroffline

which has below details, which is why i was staying away from "offline" , as we are having SF=1.

Note: If the cluster has a search factor of 1, the cluster does not attempt to reallocate primary copies before allowing the peer to go down. With a search factor of 1, the cluster cannot fix the primaries without first creating new searchable copies, which takes significant time and thus would defeat the goal of a fast shutdown.

 

 

Tags (1)
0 Karma

rnowitzki
Builder

Ok, that note is true, it takes a while.

It depends if you want your Environment and the data in it to be available during the maintenance.
If you have a change window, and the Environment is not in use anyways you'd be faster with just stopping and removing the peer, but I'd not recommend it.

Best Regards
Ralph

--
Karma and/or Solution tagging appreciated.
0 Karma

jiaqya
Builder

Yes, we have a change windows that is at least 20 hours for each peer.

We have SF=1 and RF=2, so when we use "stop", data is all available for search, while replication happens in background, so users are not seeing any data issues on report.

but issue was, they are seeing yellow triangle indicating peer down in cluster..

so i had to remove the peer from cluster, only to add after 20 hours..

so wanted to know if that could cause any data issues.

what i i know is, that cluster will perform remediation tasks if needed, and there will be excess buckets to clear.

0 Karma

jiaqya
Builder

i also believe the peer would add itself , once it comes up... as there is no other changes being done on the downed peer...

 

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...