Deployment Architecture

Which tool to use to backup indexed data efficiently ?

Contributor

Hello,

I would like to backup the indexed data on a remote server. I tried rsync:

rsync -aAXv -R -e ssh --exclude='/opt/splunk/var/lib/splunk/*/db/hot_v1_*' /opt/splunk/var/lib/splunk/INDEX_TEST/ backup@remote_ip:/backup/splunk_db

This commands works and excludes the hot directories.
But when the warm buckets are rolled out to the cold buckets, rsync will copy again the data since it will think that it is new data whereas it will be just moved to a different folder (colddb). Is there a another tool to do an incremental copy of the indexed data while keeping file status?

Thank you.

0 Karma
1 Solution

SplunkTrust
SplunkTrust

I can't comment on specific tools but in terms of what you need to backup, if you want to avoid duplicates you need to ensure only your warm buckets are backed up. I did use robocopy on Windows and configure it to just backup buckets within the hot/warm directory following the warm/cold naming convention. See this article here.

All your data moves from hot to warm and then cold so if you backup warm buckets you should be fine.
The only problem with this approach is that you don't have a backup of your hot buckets so if some of your sources are not generating too much data, your buckets might only roll to warm after days or even weeks. In that case, although not a good practise, you can force a hot to warm bucket roll. Take a look at this for more information.

View solution in original post

0 Karma

Builder

@ctaf you can roll hot to warm before backups.

It looks like you're running Linux, so a bash script like this would work:

# bash: loop through each index that has a hot bucket.
for INDEX in `find /opt/splunk/var/lib/splunk/ -name 'hot_v1_*' | cut -d'/' -f7 | sort -u`; do
    echo "Rolling over hot to warm for $INDEX."
    /opt/splunk/bin/splunk _internal call /data/indexes/$INDEX/roll-hot-buckets >/dev/null 2>&1# or use a log file.
    echo "Rolled over, rsyncing to backup host"
    rsync -aAXv -R -e ssh /opt/splunk/var/lib/splunk/$INDEX/db/db_* backup@remote_ip:/backup/splunk_db/$INDEX
done
0 Karma

SplunkTrust
SplunkTrust

I can't comment on specific tools but in terms of what you need to backup, if you want to avoid duplicates you need to ensure only your warm buckets are backed up. I did use robocopy on Windows and configure it to just backup buckets within the hot/warm directory following the warm/cold naming convention. See this article here.

All your data moves from hot to warm and then cold so if you backup warm buckets you should be fine.
The only problem with this approach is that you don't have a backup of your hot buckets so if some of your sources are not generating too much data, your buckets might only roll to warm after days or even weeks. In that case, although not a good practise, you can force a hot to warm bucket roll. Take a look at this for more information.

View solution in original post

0 Karma

Contributor

Why do you not recommand to not backup cold buckets?

0 Karma

Splunk Employee
Splunk Employee

Because you have already backed them up when they are warm... Warm rolls to Cold.. The state of the bucket should never change after it rolls out to warm.

0 Karma

Contributor

Ok but if I restore a bucket that was in the cold state in the hot folder, what will happen? Splunk will automatically roll it to the cold folder?

Moreover, the backup is done something like "every 1 hour". So if between 2 backups, there is new data in the hot bucket and it is rolled to Cold, I will loose some data monitoring only the hot folder...

0 Karma

Splunk Employee
Splunk Employee

You shouldnt be backing up hot buckets, unless you have a good way to diff against your backup and the newest. If you are using clustering, I wouldnt worry about backing up hot buckets.

Otherwise, if you want to roll backups against your hot/warm, you should manually roll your hot buckets before the backup. This will create a lot of disk i/o however.

0 Karma

Contributor

Oh I meant warm bucket.

"What if between 2 backups, there is new data in the WARM bucket and it is rolled to Cold, I will loose some data monitoring only the hot/warm folder..."

And:
"What if I restore a bucket that was in the cold state in the hot/warm folder, what will happen? Splunk will automatically roll it to the cold folder?"

0 Karma

SplunkTrust
SplunkTrust

Warm buckets are read-only.
There shouldn't be any difference between a warm bucket and the correspondent cold bucket. Just the location within your directory structure.

If you restore a warm bucket to cold, Splunk will search it in cold, if you restore a cold bucket to warm, Splunk will search it in warm.
Even easier, if you restore everything to warm Splunk will then roll whatever is needed to cold automatically based on your retention policies.

Splunk Employee
Splunk Employee

Have you read this link? : http://docs.splunk.com/Documentation/Splunk/6.2.0/Indexer/Backupindexeddata

HOT and WARM buckets are different folders on the same volume.

Warm should never have new data in it. Buckets are only writable when they are in the HOT state. Once you have a bucket rolled to warm, its contents should never change. Only time this might happen is if you kick off a backup during a hot -> warm roll.

If you restore a cold bucket, you should restore this into the thawedb path. Otherwise when they are in WARM, they will be read as standard buckets and Splunk will apply the defined retention policy to these buckets.

0 Karma