Solved: Issue with SHC and Bundle

Treize · ‎03-14-2025

Hello,

I have a problem that I can't solve.

I have a shcluster with 4 members (including the Captain) and splunk version 7.3.5.

We are in a multisite configuration. We wanted to do a test to put a Search Head in stand-alone mode and simulate a power cut with the 3 others. Everything worked, then we returned to normal. ALL CLEAR.

But recently we realized that we had a problem (bug?).

Our 4 SHC members are in the same cluster, checked on the servers directly in CLI.

But on the GUI we have two different SHcluster: the first with 3 members, the second with only 1.

show shcluster-status shows the cluster, its 4 members and its ID (starting with EDF6)

The [shclustering] stanza in the server.conf file for the 4 search heads has the ID EDF6[...].

I remind you that despite this, everything works normally.

We've tried a lot of solutions with no results. Is this a bug or do you have any ideas?

Attached are some screenshots, to make things easier.

Thank you very much

Treize

Hi there!

We finally found the solution!

On the SLM, Settings > General Setup > Edit the node > Disable monitoring then enable him

Thanks to all

View solution in original post

Treize

Hi there!

We finally found the solution!

On the SLM, Settings > General Setup > Edit the node > Disable monitoring then enable him

Thanks to all

Treize · ‎03-14-2025

@kiran_panchavat @jondukehds

I've just tried a few commands:

- “shcluster status --verbose” doesn't return any useful additional info

- No error in splunkd.log

- splunk resync shcluster-replicated-config” changes nothing

- splunk restart” changes nothing

- Switching from dynamic to static captain and back to dynamic + bootstrap, changes nothing apart from changing the captain.

Each search head is given the same ID in the cluster as the captain, the shcluster-status or the shclustering stanza.

It's really only in the GUI that I see another bundle. I don't get it...

Not having done the integration, I don't know if scripts interfere with the built-in Splunk startup but I don't think so, given the rest of the project.

isoutamo · ‎03-14-2025

Hi

1st, you have quite old version in use and this is not supported anymore! You should plan to update it asap. And when you are updating it, you must go through quite many versions or otherwise just build a new version from scratch. You cannot update it directly to 9.3.x!!!

What and how you actually did this "We wanted to do a test to put a Search Head in stand-alone mode and simulate a power cut with the 3 others. Everything worked, then we returned to normal."?? Depending on how you did this, there is big possibility that you corrupt somehow your SHC.

And as @livehybrid you should have odd number of nodes, otherwise RAFT protocol could cause some weird situations and there is big possibility that electing a new captain didn't work when another site is down. But as he expecting, if you just "remove one node from SHC" then this shouldn't hit you.

BTW how your nodes are divided between sites? Both SHC members and search peers? You probably have only one CM in one site? And did you testing just SHC side not whole environment?

One comment&guidelines. Never do any fixable steps in production like resync configurations before you know what is the reason for your situation. There are really big possibility that those leads much severe situation and leads to unfixable situations. Then the only way is restore your environment (e.g. whole SHC) from backups.

Can you show output of: (at least some node on working nodes and another example of this other node)

splunk show shcluster-status -verbose
splunk show kvstore-status -verbose

I'm cannot recall if -verbose is already working on 7.3 or did it implement in later versions?

What and how you join this "cut" node back to SHC? Did you clean it before or just add it back?

There are big possibility that your kvstore have still information of this temporary cut SHC in it's kvstore and it shows that you on GUI? On 7.3 and earlier SHC is not as robust as currently and it was quite easy to get is meshed by accident and time by time it mange to do it by itself 😞

r. Ismo

kiran_panchavat · ‎03-14-2025

@Treize

On each SHC member, run: /opt/splunk/bin/splunk show shcluster-status -verbose
The -verbose flag provides additional details, such as replication status, member health, and any pending actions. Look for discrepancies (e.g., a member marked as “Pending” or “Out of Sync”).
Check the splunkd.log on all four SHC members for errors related to clustering. Focus on: SHCMaster, SHCMember, or ConfReplication components. Errors like “failed to proxy call” or “replication failure.”
The GUI might be stuck due to a caching issue. Restart Splunk Web (not the full splunkd process) on all members: /opt/splunk/bin/splunk restart
If the GUI persists in showing incorrect data, resynchronize the replicated configuration across the cluster: On the captain, run: /opt/splunk/bin/splunk resync shcluster-replicated-config
https://docs.splunk.com/Documentation/Splunk/9.2.1/DistSearch/HowconfrepoworksinSHC . This command pulls the latest configuration from the captain and pushes it to all members, which might realign the GUI’s view.
Rolling Restart (If Needed): If the above steps don’t resolve the issue, perform a controlled rolling restart of the cluster: /opt/splunk/bin/splunk rolling-restart shcluster-members
This ensures all members restart cleanly and re-register with the captain, potentially fixing any GUI misalignment. Monitor the GUI and CLI status post-restart.

Splunk 7.3.5 is several years old, and while it’s stable for many environments, there have been reported bugs in SHC management and GUI rendering in earlier 7.x versions. Check Splunk’s Known Issues documentation

https://docs.splunk.com/Documentation/Splunk/7.3.5/ReleaseNotes/KnownIssues

Did this help? If yes, please consider giving kudos, marking it as the solution, or commenting for clarification — your feedback keeps the community going!

livehybrid · ‎03-14-2025

Hi @Treize

It is generally advised that SHC should comprise of an odd number of nodes, this is to prevent a split-brain situation - however because you have 3 which have maintained their cluster I am not sure if this is a split-brain situation.

You may need to stop the single SH, clean and re-join it to the cluster. Check out https://community.splunk.com/t5/Deployment-Architecture/SHC-New-Member-reverts-to-down-after-restart... which has some similar conversation covering this.

When you run show shcluster-status on the single SH, does it show the same cluster details as the other 3?

Do you have any custom startup scripts running on the host that might interfere with the built-in Splunk startup? e.g. running any commands to join clusters, setup captains etc.

Please let me know how you get on and consider adding karma to this or any other answer if it has helped.
Regards

Will

Issue with SHC and Bundle

search head clustering

Best Strategies to Optimize Observability Costs

Fueling your curiosity with new Splunk ILT and eLearning courses

Splunk AI Assistant for SPL 1.1.0 | Now Personalized to Your Environment for Greater ...