Deployment Architecture

is it safe to delete the files and directories under splunk/var/run/ for search heads

Engager

I am having issues with search head members not pushing changes to the captain .I read in one of the post to delete the the files and directories under splunk/var/run/ and then do a restart

0 Karma

Builder

Hi @lmvmandadi ,

The answer to your question of whether it's safe to delete is, "it depends". If you're in a SHC environment you might be able to remove items from that folder without some major impacts. However, given the questions and the responses in the thread you've posted, I would probably look at re-building that particular search head.
1. Stop the problem search head
2. Back up the /opt/splunk (or whatever $SPLUNK_HOME is associated to) folder
3. Move or delete the /opt/splunk folder
4. Install a clean copy of Splunk the same version as your other members
5. Add the clean install member to the SHC and let the SHC captain re-sync all the files

The downside to the above is that you will lose any changes that you made on that particular server (not lose permanently because we backed up all the config files, but the changes will need to be re-done on the clean copy).

I think this might be a better solution, given that you'll probably spend more time troubleshooting SHC issues, trying to resolve them and potentially introduce other problems into the SHC.

Hope this helps.

0 Karma

Ultra Champion

Most of the data should be under splunk/var/run/splunk/dispatch.

-- In the dispatch directory, a search-specific directory is created for each search or alert. Each search-specific directory contains several files including a CSV file of the search results, a search.log file with details about the search execution, and more. These are 0-byte files.

You can read about at Dispatch directory and search artifacts

The bottom of the page speaks about -

-- Clean up the dispatch directory based on the age of directories

0 Karma

Influencer

What issues are you facing? Have you checked the internals for shc deployer pushing errors first? Deleting the search artifacts residing in /var/run/ won't necessarily help

0 Karma

Engager

From two days I am having search head clustering issuses " Search head cluster member (https://hesplsrhc003:8089) is having problems pushing configurations to the search head cluster captain .Changes on this member are not replicating to other members.

I tried to change the captain , done a rolling restart , ran the resync command but still have issues

0 Karma

Influencer

You can enable more.agressive logging for the shc components and see what it says. Is your network connection between members ok? Can you check it?

0 Karma

Engager

Yes.I checked the logging .I see the below error and the connection between the members are fine

07-12-2019 14:50:57.458 -0400 ERROR ConfReplicationThread - Error pushing configurations to captain=https://hesplsrhc004:8089, consecutiveErrors=2333 msg="Error in acceptPush: Non-200 status_code=400: ConfReplicationException: Cannot accept push with outdated_baseline_op_id=3dfc93bbf15bcbb2d0c2c8b69d542d7d05181bb2; current_baseline_op_id=5d0509452c20f0c738813010a053ae57e4aefb64": Search head clustering: Search head cluster member (https://hesplsrhc002:8089) is having problems pushing configurations to the search head cluster captain (https://hesplsrhc0048089). Changes on this member are not replicating to other members.

0 Karma

Influencer

Got ya. So that message is much more informative. Your SHC members need to inform the captain of changes they make, so he replicates them to the remaining ones. The problem is what your member is pushing is too far back compared to what the captain has. So you need to ensure there is a common baseline in all of them, meaning you need to resync them.

I'd start by splunk show shcluster-status and check the last_replication_conf in the one not the captain and compare to the captain. A manual resync of the members should then be done so they share a common commit.

Follow the doc here: https://docs.splunk.com/Documentation/Splunk/7.3.0/DistSearch/HowconfrepoworksinSHC#Perform_a_manual...

0 Karma

Engager

Thank you for your mail .I have the manual resync by running "splunk resync shcluster-replicated-config" but nothing has changed .I have ran this command from two days but no use

The last replication for all the members is last_conf_replication : Fri Jul 12 17:11:48 2019 which I think not an issue

0 Karma

Engager

I got the following error for one of the member

Downloaded an old snapshot created 91696 seconds ago; Check for clock skew on this member or the captain; If no clock skew is found, check the captain for possible snapshot creation failures

0 Karma

Influencer

There's a parameter controlling when the changes are erased in server.conf: conf_replication_purge.eligibile_age. Its default is one day (86400 secs).
What do you mean "I have ran this command from two days" ?

0 Karma

Engager

I mean the manual resync command I have ran it two days ago and yesterday also ,but I still see the error .

Coming to the old snapshot thing.What changes can I make in order to make the resync with the latest time

0 Karma

Influencer

You need to run that command in the member that had the issue, and you'd have "The member has been synced to the latest replicated configurations on the captain."
Is that what you've done? run the resync command on the member with the issue?

0 Karma

Engager

Yes I have done on the search head that had the issue .Once it showed it synced with the latest replication and some times it shows the clockskew error .I raised a ticket with splunk support

0 Karma