About svasan_splunk

svasan_splunk · ‎12-19-2014

This is okay. It just means that the bucket has been frozen on all nodes and removed completely from the cluster so the cluster master no longer knows about the bucket which is good for our purposes. The problem happens if the bucket is frozen only on a subset of the peers and then the master is restarted and it forgets that the bucket was frozen and proceeds to fixup since at least one valid copy of it exists Things I would suggest: Do the freeze command above for all the previously frozen buckets in that one index. (Though, if it has only 3 buckets and it hasn't fixed itself up yet, this might not be the problem. But we should probably try it in any case.) Commit the next generation as indicated above. Check the UI to see how things stand. If that doesn't work, re-add all the peers using the command dxu_splunk mentioned. (You need to do it for all 3 indexers; do it one at a time and wait for the master dashboard UI to show that peer as UP before doing the next one.) Sometimes a state mismatch between the master and peers can cause this problem and the re-add would fix some of that. Lets see where things stand after the above. If we still have a problem, we might need to figure out which bucket is causing the problem and so on. There are trouble shooting endpoints to check which exist on 6.0.x and above, but am not sure of 5.0.x. We might have to do it differently on 5.0.x I'll set up a 5.0.x cluster and take a look.

svasan_splunk · ‎12-18-2014

nivedita_viswanath, That error might mean a network issue (or more precisely a consistent replication issue ) between the two nodes. See http://docs.splunk.com/Documentation/Splunk/6.1.3/Indexer/Bucketreplicationissues#Network_issues_impede_bucket_replication Is the replication port configured properly on the two nodes. Are there any errors related to that in the splunkd logs?

svasan_splunk · ‎12-18-2014

Also, if it is just 3 buckets, you can probably let it be for a bit and it should fix itself up.

svasan_splunk · ‎12-18-2014

Can you try this from the cli: for i in cat /var/tmp/frozen_buckets ; do curl -d "" -k -u admin:changeme https://localhost:8089/services/cluster/master/buckets/$i/freeze; done (EDIT: there is a backtick around cat /var/tmp/frozen_buckets not sure if that shows up in the comment) and then: $SPLUNK_HOME/bin/splunk _internal call /cluster/master/control/directive/commit_generation -method POST You can use a file with just the bucket-ids from the index you care about or just do everything. Can you let me know how that goes.

svasan_splunk · ‎12-18-2014

zislin, Did this problem show up after you restarted the master? In 5.0.x, the cluster master did not track frozen buckets properly. After a restart, the master would then proceed to fix up buckets that were previously frozen. See here for more info: http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Upgradeacluster#Why_the_safe_restart_cluster_master_script_is_necessary. The UI in 5.0.x also reported the worst case always: ie if there was even one bucket with no searchable copy then it would report that index as having no searchable copies. The problem might just be caused by a subset of the buckets like these frozen buckets. When you restart the master, a similar procedure as mentioned in the link for upgrades is needed. I wonder if this could be the problem in your case. Since you've already restarted the master, you cannot use that script as it is anymore since the information is already lost. But we might still be able to recover by (1) just giving the cluster enough time or (2) using search to figure out the list of buckets to be fixed and then scripting it from there. Try this search on the master from the cli, to get a list of frozen buckets $SPLUNK_HOME/bin/splunk search 'index=_internal component=CMMaster "remove bucket" frozen=true | dedup bid | table bid' -preview 0 > /var/tmp/frozen_buckets and maybe also: grep my_index /var/tmp/frozen_buckets | wc -l to see how many such buckets show up for that index. That will tell us if this is the problem

svasan_splunk · ‎12-18-2014

Looks like that endpoint was added in 5.0.2 so that endpoint should work in 5.0.7

svasan_splunk · ‎12-12-2014

jackiewkc, Did the master start up okay? I suspect there is probably an error in the log saying clustering was not initialized on the master. The master needs a 'site=...' attribute too which is missing in your config. The site attribute should be under [general] stanza as dxu pointed out. Also site_replication_factor and site_search_factor are need on the master not the searchheads. See: http://docs.splunk.com/Documentation/Splunk/6.2.0/Indexer/Multisiteconffile The cli commands for setting up indexer clustering will put the entries under the right stanza. See: http://docs.splunk.com/Documentation/Splunk/latest/Indexer/MultisiteCLI

svasan_splunk · ‎05-01-2014

The master has one replication_factor which it applies to all the indexes. The repFactor set per index currently only serves to turn on/off clustering (via auto and 0 resp) and cannot be used to set per-index replication factor. (So if clustering is turned on for that index via auto the master applies the same global replication policy to it.) So, yea, there is no per-index replication policy and the indexes.conf documentation should be clearer that the only values currently valid are 0 and auto.

svasan_splunk · ‎02-11-2014

samlaw, What does splunk status report? Were you able to access any REST endpoint (say https://hostname:mgmtPort/services/) via say the browser

svasan_splunk · ‎02-11-2014

Ed_alias, Hot buckets are replicated too. (The replication is not per-event but a certain slice of data.) See http://docs.splunk.com/Documentation/Splunk/6.0/Indexer/Howclusteredindexingworks for more information. Could you elaborate on what exactly was the issue?

svasan_splunk · ‎11-11-2013

R.Turk, How much data did you index? (The replication is not per-event but per some amount of data. I'm wondering if the replication never happened before the shutdown.) Also are you sending data via forwarder or local file via monitors? If forwarders, is your forwarder auto-lbing across all the peers? If the peer goes down without replicating the data, then the forwarder should just send the data to some other peer. But if you are indexing local files via monitors that won't happen. You should be using forwarders that have acks turned on and set to auto-lb across all those peers.

svasan_splunk · ‎10-09-2013

echalex, Try -secret instead of -pass4SymmKey. That might be a cli doc issue to be fixed.

svasan_splunk · ‎10-08-2013

Ricapar, each indexer in the cluster uses upto 1GB for that index. With 3 peers, that would be a max of 3GB in total across the cluster. I believe the index size is the size of the compressed rawdata of all the buckets in the index where each bucket is counted only once (that is, copies are not counted). It doesn't include the size of hot buckets. So this is almost the actual size of all the compressed raw data in the index. The UI gets it from master splunkd. The master knows the final size of a bucket when it rolls from hot to warm. (That's also why it doesn't know the size of hot buckets). On the master's managment endpoint if you go to /services/cluster/master/bucket/ you should see a size listed if the bucket is a warm bucket. See: http://docs.splunk.com/Documentation/Splunk/5.0.5/RESTAPI/RESTcluster#cluster.2Fmaster.2Fbuckets.2F.7Bname.7D

svasan_splunk · ‎10-08-2013

Since you plan to bring back the peer don't use offline --enforce-counts. That is to completely remove the peer from the cluster and the master then makes extra copies to meet RF/SF. That's not what you want for this. (The peer doesn't shutdown until RF and SF are met. So if there aren't enough other peers to meet RF/SF then this peer won't shutdown) If your cluster is currently searchable and you want to try to preserve that, use offline. This might take a bit of time ( a few minutes). It waits for buckets to become primary somewhere else and also for any ongoing searches to finish (with a timeout: it still forces shutdown after some minutes IIRC). If you planning to do some maintenance work before bringing back the peer, make sure to set restart_timeout appropriately on the master. It defaults to 10 mts I think. If your cluster is already in a weird state and you are just going to stop the peer and bring it back up, the simplest may be to just do splunk stop on the peer. Though, the master will initiate fixup to meet RF/SF after a 60s timeout.

svasan_splunk · ‎09-19-2013

That could be the cause of your problem. Though the BucketReplicator errors on the originating node not on the target of the replication. When it says "Too many streaming errors" that's anything that causes the replication to fail. Do you see why the writes fail? Did they just timeout? Also when you say you checked the network adapter, what did you check?

svasan_splunk · ‎09-17-2013

ah, sorry. Try this: http://docs.splunk.com/Documentation/Splunk/5.0/Indexer/Keydifferences

svasan_splunk · ‎09-17-2013

kaizidorfa, Are you using clustering on ec2? We have noticed some weird clock behaviour on ec2 which was causing some problems. (The peers were thought to have timed out when the clock skips backwards which it seems to do every now and then. The peers then have to re-add themselves and this forces them to reject searches with the old generation). This is fixed in an upcoming 5.0.x version (5.0.5 i think)

svasan_splunk · ‎09-17-2013

No, it will likely cause lots of problems. The master node is the node that picks the target of replications either for a newly created -ie hot - bucket or for a warm/cold bucket. It has no concept of 2 different sets of nodes and picks targets at random out of one global pool. So for either hot or warm/cold, it could pick a node in the same data center as the target. For a warm/cold bucket, if a replication of a bucket fails because of the acceptFrom, the master would then again schedule another replication for that bucket. This would be repeated until it finally by random choice picks 2 nodes in different data centers ( so with 2 data centers say 2 tries to get it right). For a hot bucket, the situation is worse. The source (or originating node) rolls the hot bucket on replication failure. If you have RF=3, and you have 2 data centers, it is likely that of the 2 targets at least one of them is local. Since local replication is blocked the hot bucket replication to that target would fail. And so the source would roll that bucket. This would happen repeatedly since every hot bucket created is being replicated to 2 other nodes one of whom is likely local. End result: lots of small buckets which would very badly impact search.

svasan_splunk · ‎09-17-2013

No, the recommended setup is for the master to be a separate instance. (The master node is the only one with the cluster-wide view of the system and is the node that initiates any corrective action; so while it is not as loaded as the indexers, it is still recommended that it being a separate splunk instance.) See http://docs.splunk.com/Documentation/Splunk/6.0/Indexer/Keydifferences

svasan_splunk · ‎09-17-2013

Were there any replication/tcp related errors in the target node's splunkd.log?

svasan_splunk · ‎02-25-2013

In the 5.0 release, rolling-restart, apply, "rolling offline" - ie offlining all peers one after the other one at a time - are all not search-safe. Updating the configuration cluster-wide via apply really does behave like a "maintenance mode": data is safe but it may not be searchable during the rolling restart. After the rolling restart completes, the cluster should be searchable (I believe the master commits a new generation after). The docs don't seem to state this explicitly; I'll try to get those updated. Also, we are working to fix the limitations detailed below. To explain what is going on a bit more: Every peer is potentially both the source and the target of ongoing hot bucket replications: it originates some hot buckets that are replicated to other peers and is the target (and potentially the searchable target; this is the problematic case) for hot buckets originating on other peers. Each peer is also the primary for the hot buckets it originates. When we offline a peer - say peer A - it rolls the hot buckets it originates cleanly and transfers the primary responsibility for those hot buckets (along with any other warm buckets it is primary for) to some other peers. It doesn't worry about any hot bucket - say bucket B1 - it is the searchable streaming target for originating from some other peer as the source is still up and is responsible for searching that bucket. So offlining one peer works by fixing up the hot buckets it originates and not worrying about the hot buckets it is receiving. For a rolling-restart though, those do come into the picture. Now when peer A comes back, its copy of bucket B1 might be invalid. In the ace release, we don't fix up the bucket mid-stream - ie catch up on the data that has already been indexed while also keeping track of data that is now going to that bucket. Instead, the source rolls the bucket at that point. We cannot also fixup the search meta data files mid-stream. The copy on the peer that just restarted is likely invalid and is discarded and so the master fixes up the bucket. If the copy that was discarded was a searchable copy, this would mean that another copy has to be made searchable. This can take a bit of time depending on the size of the bucket. During this time, with a SF=2, the source of B1 is the only peer with a valid searchable copy for B1. if the source of B1 also goes offline, then there is no searchable copy of the bucket online while the source is restarting. (There is another copy being made searchable but it may not have finished yet; the source which has the only complete searchable copy has gone offline). So: data is not lost, but there may be no searchable copy online at that point. Since in a cluster every peer is likely the searchable target for some bucket and every peer is going to go offline at some point or the other, the above situation is likely true for one or more buckets through-out the rolling restart process. So the cluster itself won't be search-safe through the rolling restart process. Hope that helps to understand what is going on. If you have more questions, ask away. And, hopefully, updating the config cluster-wide is infrequent enough for you to be able to treat it as down time for searches. We are working to fix this going forward.

svasan_splunk · ‎02-21-2013

mathu, In the 5.0.x release, the apply command does the same thing as the rolling-restart command (after pushing the bundle) which is not search-safe. That specific error message is a known issue (SPL-52430 )

svasan_splunk · ‎02-21-2013

jtacy, You can use register_replication_address to specify the address that the peer is accessible on. (Also see documentation at http://docs.splunk.com/Documentation/Splunk/latest/admin/Serverconf )

svasan_splunk · ‎02-20-2013

Karun, It looks like the 2 peers did register with the master. When you configured the master, did you change it's replication factor to be 2? (Check in etc/system/local/server.conf on the master under [clustering] stanza.) By default the master uses 3 and the first time it comes up, it waits for replication_factor peers before it commits the first generation. Also, from the messages, it looks like you might be on the 5.0 build. 5.0.2 is the latest and you might want to just start with that.

Posts	24
Solutions	7
Karma Given	1
Karma Received	32
Member Since	‎11-11-2011

Online Status	Offline
Date Last Visited	‎06-05-2020 02:03 AM

Re: How to troubleshoot why an index does not have...

Re: Re-enabling a disabled cluster peer throwinf e...

Re: How to troubleshoot why an index does not have...

Re: How to troubleshoot why an index does not have...

Re: How to troubleshoot why an index does not have...

Re: How to troubleshoot why an index does not have...

Re: The searchhead is unable to update the peer in...

Re: Can the replication factor be set per index?

Re: Splunk apply cluster-bundle not working

Re: Hot Buckets Replications

Re: Indexer Replication Failover Testing

Re: In handler 'clusterconfig': Argument "pass4Sym...

Re: In a clustered environment, how do I figure ou...

Re: Remove peer from cluster then re-add

Re: Too many streaming errors to target on cluster

Re: Can i use deployement server also as Master no...

Re: Search results may be incomplete, peer 's sear...

Re: High availability: Splunk cluster across two d...

Re: Can i use deployement server also as Master no...

Re: Too many streaming errors to target on cluster

Re: apply cluster-bundle

Re: apply cluster-bundle

Re: Clustering failure on hosts with multiple IPs

Re: Clustering cannot see the peer dashboard

Join the Conversation