Deployment Architecture

Has anyone attempted clustering (v5.0) on indexes with existing data?


Firstly, yes I have read this ( ).

Disclaimer: So the official splunk corp line on this is "There is no supported procedure for this conversion. If you are interested in having this performed, please contact Splunk Professional Services to discuss the trade-offs and requirements for this operation."

So with that out of the way has anyone actually attempted to cluster indexes with existing data?

How have you got around the issue of initial replication of existing data on multiple indexes?

The idea in my head is that I would have to manually sync all my indexes with the existing data prior to enabling the cluster. This way each index would have a standalone copy of the historical data. So the standard copy buckets across rename the bucket number's etc etc would need to be done.

Then when the cluster is enabled it creates all the new long bucket name which are then replicated around. My theory is that if any of these indexes then go down they have their standalone index AND the replicated data to refer to. Thus an entire dataset of new and historical data. The limitation is that it would be at most a single copy of each index per indexer. ie. advanced replication settings of less than 1 copy per index (ie. a replication factor equal to total number of nodes) is impossible to do with this method.

Interested to know if 1. You've done this. 2. What issues you encountered.

The other thing I have just thought of would be that there will be duplicated data within each index now by copying the buckets around manually.

Super Champion

So this question is old, Lucas not sure if you came up with a solution or not, but here's some additional info:

First off, you don't want to simply replicate existing buckets as is. Duplicating your buckets by hand will result in Splunk seeing the data twice. Splunk has no way of knowing that the same bucket exists two places, so it will treat both (or all) copies as pre-clustered buckets and therefore search all copies. So not only do you have the overhead of searching the same data multiple times, but now you'll need some sort of "dedup" or other clever way to eliminate duplicates in your searches. Not fun.

It's pretty easy to trick Splunk into converting non-clustered buckets into clustered ones. If you've taken a look at the bucket folder naming, you'll pretty quickly see the difference between the names of clustered and non-clustered buckets. The biggest difference is that the clustered buckets has the GUID as part of the name, which indicates which server the bucket originated. Keep in mind that the cluster master is essentially stateless between restarts, so everything it knows about the cluster is gleaned during the initialization phase; this mean that you can pretty easily trick Splunk into thinking that a non-clustered bucket is a single-site clustered bucket. (Multi-site clustering is a completely different beast in this regard. Splunk made it much more difficult to pull of this kind of a trick at a multi-site level.)

Bucket conversion itself can be done using something like this:

GUID=$(cat $SPLUNK_HOME/etc/instance.cfg | grep '^guid' | tr -d ' '| cut -d'=' -f 2)
find $SPLUNK_DB -type d -regex '.*/db_[0-9]+_[0-9]+_[0-9]+' | ( while read bkt; do mv -v $bkt ${bkt}_${GUID} ; done; )

Use this at your own risk! Only run this on the indexes you want to replicate. Copy a small number of buckets over to a test server and do this all in a non-production cluster before you attempt this on prod... (And so on) Also keep in mind that there are additional conversation steps required beyond this, but this is the one bit in particular that's not really documented.

Once your buckets are renamed, I suppose your could pre-replicate them out to a secondary node. However the bucket should be renamed from "db_" to "rb_" (indicating that it's a replicated bucket). Depending which version of Splunk you're running (5.x vs 6.x) where the replicated bucket should end up will be different. And of course if you already have multiple indexers, trying to pre-share this data gets a lot more complicated and probably isn't worth it. (In Splunk 6 the bucket replication is actually more efficient because it tries to copy both the raw and index data at once, whereas in 5.x Splunk would only copy the raw data and the require the indexes to be rebuild on the destination server; which consumed considerably more resources.)

And again, if any of this seems difficult or confusing to you, and you value your data, please contact an expert. There's Splunk PS, and lots of Splunk partners who are qualified to help with this kind of conversion. There's also a number of pros/cons to consider with clustering in general, which is good to talk though with someone who's had experience maintaining a cluster.

For full disclosure, I work for Splunk partner.


le bump

So clustering has been around for quite a while now and I still havn't heard about a method to migrate from non-clustered data to clustered with replication of historic data from either any answers here OR professional services 😕

My manager has asked how I can do this (REGARDLESS OF PERFORMANCE PENALTY) as it is the only protection against data loss for the volume of data we have.

Might have to see if I can figure out how to trick splunk into reindexing the raw data and putting it into the clustered indexes.

0 Karma

Splunk Employee
Splunk Employee

By default, all your pre-5.0 data is immediately searchable after enabling clustering, but it's not replicated. The reason these are not replicated is because we don't want to saturate your network with all replication activities when you enable clustering.

One workaround is to have the old data to age out. Generally users retain the data in splunk for 90 days and then it ages out. So technically you have a single copy of the data for 90 days and all new 5.0 data will have required number of replicated copies.

0 Karma


I realise this.

But what if II DO want it replicated? As far as I know there is no way to achieve this without reindexing the original data.

Use case. I have multiple tb's worth of historic data that IS used for long term diagnostics and capacity reasons.

Currently this data is spread across multiple indexes.

If any of them die we won't have a full copy to search across. The custom saw this as the "saviour" for possibility of losing data. But it will only do this from the moment when replication is turned on.

0 Karma