Hello,
We are encountering an issue after a data migration.
The data migration was needed to increase the disk performances.
Basically we moved all the Splunk data from disk1 to disk2 on a single Splunk Indexer instance belonging to a Multi-Site Splunk Indexer Cluster.
The procedure was:
Once we have restarted Splunk some buckets have been marked as DISABLED.
This is due because once at point 2 we have stopped Splunk the hot buckets have rolled to warm (on disk1).
Therefore during the rsync at point 4 those freshly rolled warm buckets of disk1 have been copied to disk2 where buckets hot with the same ID were present. Due to this the conflict happened and the buckets were marked as DISABLED.
So basically now DISABLED buckets could have more data (but not all the data) than the non disabled ones. Furthermore non disabled ones have been replicated within the cluster.
Do you think there is a way to recover those DISABLED buckets so that they will be searchable again?
I see here:
it seems the solution could be if I well understood (with Splunk instance not running) move the data from for example DISABLED-db_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694 to db_1631215114_1631070671_herechangethebucketID_3C08D28D-299A-448E-BD23-C0E9B071E694
If so:
Here is what I find in the internal logs checking for one of the affected bucket:
Query:
index=_internal *1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694 source!="/opt/splunk/var/log/splunk/splunkd_ui_access.log" source!="/opt/splunk/var/log/splunk/remote_searches.log" | sort -_time
Result:
09-09-2021 14:18:41.758 +0200 INFO HotBucketRoller - finished moving hot to warm bid=_internal~448~3C08D28D-299A-448E-BD23-C0E9B071E694 idx=_internal from=hot_v1_448 to=db_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694 size=10475446272 caller=size_exceeded _maxHotBucketSize=10737418240 (10240MB,10GB), bucketSize=10878386176 (10374MB,10GB)
09-09-2021 14:18:41.767 +0200 INFO S2SFileReceiver - event=rename bid=_internal~448~3C08D28D-299A-448E-BD23-C0E9B071E694 from=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/448_3C08D28D-299A-448E-BD23-C0E9B071E694 to=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694
09-09-2021 14:18:41.795 +0200 INFO S2SFileReceiver - event=rename bid=_internal~448~3C08D28D-299A-448E-BD23-C0E9B071E694 from=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/448_3C08D28D-299A-448E-BD23-C0E9B071E694 to=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694
09-09-2021 14:18:41.817 +0200 INFO S2SFileReceiver - event=rename bid=_internal~448~3C08D28D-299A-448E-BD23-C0E9B071E694 from=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/448_3C08D28D-299A-448E-BD23-C0E9B071E694 to=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694
09-09-2021 15:53:19.476 +0200 INFO DatabaseDirectoryManager - Dealing with the conflict bucket="/products/data/xxxxxxxxx/splunk/db/_internaldb/db/db_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694"...
09-09-2021 15:53:19.477 +0200 ERROR DatabaseDirectoryManager - Detecting bucket ID conflicts: idx=_internal, bid=_internal~448~3C08D28D-299A-448E-BD23-C0E9B071E694, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/hot_v1_448, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/db_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-db_1631215114_1631070671_448_3C08D28D-299A-448E-BD23-C0E9B071E694. Please check this disabled bucket for manual removal.\nDetecting bucket ID conflicts: idx=_internal, bid=_internal~595~E17D5544-7169-4D32-B7C0-3FD972956D4B, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/595_E17D5544-7169-4D32-B7C0-3FD972956D4B, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1628215818_1627992904_595_E17D5544-7169-4D32-B7C0-3FD972956D4B. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-rb_1628215818_1627992904_595_E17D5544-7169-4D32-B7C0-3FD972956D4B. Please check this disabled bucket for manual removal.\nDetecting bucket ID conflicts: idx=_internal, bid=_internal~591~12531CC6-0C79-473A-859E-9ADF617941A2, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/591_12531CC6-0C79-473A-859E-9ADF617941A2, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1628647804_1628215848_591_12531CC6-0C79-473A-859E-9ADF617941A2. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-rb_1628647804_1628215848_591_12531CC6-0C79-473A-859E-9ADF617941A2. Please check this disabled bucket for manual removal.\nDetecting bucket ID conflicts: idx=_internal, bid=_internal~606~1D0FBF00-A5FF-4767-A044-F3C6F01BAD84, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/606_1D0FBF00-A5FF-4767-A044-F3C6F01BAD84, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1630204023_1629772040_606_1D0FBF00-A5FF-4767-A044-F3C6F01BAD84. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-rb_1630204023_1629772040_606_1D0FBF00-A5FF-4767-A044-F3C6F01BAD84. Please check this disabled bucket for manual removal.\nDetecting bucket ID conflicts: idx=_internal, bid=_internal~603~1D0FBF00-A5FF-4767-A044-F3C6F01BAD84, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/603_1D0FBF00-A5FF-4767-A044-F3C6F01BAD84, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1631172918_1631063432_603_1D0FBF00-A5FF-4767-A044-F3C6F01BAD84. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-rb_1631172918_1631063432_603_1D0FBF00-A5FF-4767-A044-F3C6F01BAD84. Please check this disabled bucket for manual removal.\nDetecting bucket ID conflicts: idx=_internal, bid=_internal~436~2E5A3717-4C0C-487C-87D3-A7127B3DB42D, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/436_2E5A3717-4C0C-487C-87D3-A7127B3DB42D, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1631196626_1631073242_436_2E5A3717-4C0C-487C-87D3-A7127B3DB42D. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-rb_1631196626_1631073242_436_2E5A3717-4C0C-487C-87D3-A7127B3DB42D. Please check this disabled bucket for manual removal.\nDetecting bucket ID conflicts: idx=_internal, bid=_internal~589~12531CC6-0C79-473A-859E-9ADF617941A2, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/589_12531CC6-0C79-473A-859E-9ADF617941A2, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1631199124_1630935298_589_12531CC6-0C79-473A-859E-9ADF617941A2. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-rb_1631199124_1630935298_589_12531CC6-0C79-473A-859E-9ADF617941A2. Please check this disabled bucket for manual removal.\nDetecting bucket ID conflicts: idx=_internal, bid=_internal~594~E17D5544-7169-4D32-B7C0-3FD972956D4B, path1=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/594_E17D5544-7169-4D32-B7C0-3FD972956D4B, path2=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/rb_1631215283_1630935291_594_E17D5544-7169-4D32-B7C0-3FD972956D4B. Temporally resolved by disabling the bucket: path=/products/data/xxxxxxxxx/splunk/db/_internaldb/db/DISABLED-rb_1631215283_1630935291_594_E17D5544-7169-4D32-B7C0-3FD972956D4B. Please check this disabled bucket for manual removal.\n
Thanks a lot,
Edoardo
@isoutamo thanks for your feedback. We couldn’t use any suite at storage level because we changed the number of LUNS seen by the OS (to increase performance). We tested moving the data thanks to Splunk Cluster facility of replicating buckets but it was taking too much time moving 8TB per Indexer, therefore we went with rsync.
I have written 2 guides:
Here the details:
How to rsync data on an Indexer from one disk to another disk
Rsync procedure to move Splunk data from OLD to NEW disk
Example:
Splunk data are stored here SPLUNK_DB=/products/data/splunk
disk1 is mounted as /products/data/splunk
disk2 is mounted as /products/data/splunk2
script_01.sh perform the first big copy (and can be executed with Splunk running)
script_02.sh perform the second last copy (and have to be executed with Splunk NOT running and with the Cluster in maintenance mode)
1- On Splunk Indexer create the 2 scripts
in /opt/splunk/
script_01.sh
#!/bin/bash
time rsync -aP /products/data/splunk/ /products/data/splunk2/
script_02.sh
#!/bin/bash
time rsync -aP --delete /products/data/splunk/ /products/data/splunk2/
2- Procedure to run first script
#run script 01 without stopping splunk for the first sync
nohup /opt/splunk/script_01.sh 2>&1 &
3-Procedure to run second script
#Once finished first script check is finished with
ps -ef | grep rsync
check nohup.out
#check file system (space usage on disk2 should not increase anymore)
df -h
df -hm
#put the cluster in maintenance mode on Master Node
splunk enable maintenance-mode
splunk show maintenance-mode
splunk show cluster-status --verbose | head -20
#stop splunk on Indexer
./splunk stop
#run second script to perform the last copy in delta
nohup /opt/splunk/script_02.sh 2>&1 &
#Once finished second script check is finished with
ps -ef | grep rsync
check nohup.out
#check file system
df -h
df -hm
# Correct procedure to check filesystem01 is in sync with filesystem02
## list files and save the output
find /products/data/splunk/ -print > /opt/splunk/file_disk_01.txt
find /products/data/splunk2/ -print > /opt/splunk/file_disk_02.txt
## modify absolute path to then perform the comparison
vi file_disk_02.txt
:%s/splunk2/splunk/g
:wq
## sort the file to speed-up the diff command
sort file_disk_01.txt > file_disk_01_sorted.txt
sort file_disk_02.txt > file_disk_02_sorted.txt
## perform the diff (should not give any result if files are equal between disk1 and disk2)
diff file_disk_01_sorted.txt file_disk_02_sorted.txt
## also perform a md5 check (should give same hash)
md5sum file_disk_01_sorted.txt
md5sum file_disk_02_sorted.txt
splunk@myindexer:~ > md5sum file_disk_01_sorted.txt
31xx87c049fxx24b353xx8c45xx1b198 file_disk_01_sorted.txt
splunk@myindexer:~ > md5sum file_disk_02_sorted.txt
31xx87c049fxx24b353xx8c45xx1b198 file_disk_02_sorted.txt
#Before switch the disk and restart splunk check also if there are already some DISABLED bucket (just to avoid confusion if any will raise after splunk restart)
ls -la /products/data/splunk/db/*/db/DISABLED* | grep products
ls -la /products/data/splunk2/db/*/db/DISABLED* | grep products
#swicth disk (umount old disk, umount new disk, change mount point in /etc/fstab, mount new disk)
#restart splunk on Indexer
./splunk start
#check if DISABLED buckets are present (they should not be present unless you already have)
ls -la /products/data/splunk/db/*/db/DISABLED* | grep products
#Check Monitoring console on MN is OK
#Run some query over last 7 days on SH
index=_internal splunk_server=myindexer
index=_internal splunk_server=myindexer | stats count
index!=_* index=* splunk_server=myindexer
index!=_* index=* splunk_server=myindexer | stats count
index=_internal
index!=_* index=*
index=_internal | stats count
index!=_* index=* | stats count
#disable maintenance mode on Master Node
splunk disable maintenance-mode
splunk show maintenance-mode
#Run again some query over last 7 days on Search Head after Replication factor and Search Factor are met
How to recover DISABLED buckets
As indicated here it seems there are 9 digits available for the bucketID
https://community.splunk.com/t5/Deployment-Architecture/Max-Value-for-bucket-ID/m-p/76731
Example:
Splunk data are stored here SPLUNK_DB=/products/data/splunk
#Backup the DISABLED bucket
##list the folders
ls -la /products/data/splunk/db/*/db/DISABLED* | grep product
##create filelist.txt with the list of folders (remove trailing ":" from previous command if any)
tar -cvf /products/data/splunk/bckDisabled/bckbucket.tar -T /products/data/splunk/bckDisabled/filelist.txt
#Put cluster in maintenance mode on Master Node
splunk enable maintenance-mode
splunk show maintenance-mode
#Stop Splunk on your Indexer
./splunk stop
#Move the DISABLED folder into a non-disabled one increasing the BucketID (in this example from 105 to 100105)
mv /products/data/splunk/db/audit/db/DISABLED-db_1631192613_1630932002_105_3C08D28D-299A-448E-BD23-C0E9B071E694 /products/data/splunk/db/audit/db/db_1631192613_1630932002_100105_3C08D28D-299A-448E-BD23-C0E9B071E694
Note: Considering buckets are usually cancelled after retention time is reached, BucketID have to be higher enought that will never be reached in your environment.
#Restart Splunk on your Indexer
./splunk start
#Check data are searchable
index=_audit earliest=1630932002 latest=1631192613 splunk_server=myindexer
#Check bucket is searchable with a REST call
Check with REST
https://yourmasternode:8089/services/cluster/master/buckets/_audit~100105~3C08D28D-299A-448E-BD23-C0E9B071E694
you shoud see bucket searchable on your indexer
#Remove maintenance and check if replicated (once Search Factor and Replication Factor are met)
Check with REST
https://yourmasternode:8089/services/cluster/master/buckets/_audit~100105~3C08D28D-299A-448E-BD23-C0E9B071E694
you shoud see bucket searchable on more than one indexer (if you are on an Indexer Cluster) based on your SF and RF
Note: if you have DISABLED buckets on an Indexer Cluster they could be both db_* and rb_*
If you recover them all you could have duplicated data, but better having double than nothing.
Hope those guides will help you on planning a data migration or solving buckets conflicts.
Best Regards,
Edoardo
@isoutamo thanks for your feedback. We couldn’t use any suite at storage level because we changed the number of LUNS seen by the OS (to increase performance). We tested moving the data thanks to Splunk Cluster facility of replicating buckets but it was taking too much time moving 8TB per Indexer, therefore we went with rsync.
I have written 2 guides:
Here the details:
How to rsync data on an Indexer from one disk to another disk
Rsync procedure to move Splunk data from OLD to NEW disk
Example:
Splunk data are stored here SPLUNK_DB=/products/data/splunk
disk1 is mounted as /products/data/splunk
disk2 is mounted as /products/data/splunk2
script_01.sh perform the first big copy (and can be executed with Splunk running)
script_02.sh perform the second last copy (and have to be executed with Splunk NOT running and with the Cluster in maintenance mode)
1- On Splunk Indexer create the 2 scripts
in /opt/splunk/
script_01.sh
#!/bin/bash
time rsync -aP /products/data/splunk/ /products/data/splunk2/
script_02.sh
#!/bin/bash
time rsync -aP --delete /products/data/splunk/ /products/data/splunk2/
2- Procedure to run first script
#run script 01 without stopping splunk for the first sync
nohup /opt/splunk/script_01.sh 2>&1 &
3-Procedure to run second script
#Once finished first script check is finished with
ps -ef | grep rsync
check nohup.out
#check file system (space usage on disk2 should not increase anymore)
df -h
df -hm
#put the cluster in maintenance mode on Master Node
splunk enable maintenance-mode
splunk show maintenance-mode
splunk show cluster-status --verbose | head -20
#stop splunk on Indexer
./splunk stop
#run second script to perform the last copy in delta
nohup /opt/splunk/script_02.sh 2>&1 &
#Once finished second script check is finished with
ps -ef | grep rsync
check nohup.out
#check file system
df -h
df -hm
# Correct procedure to check filesystem01 is in sync with filesystem02
## list files and save the output
find /products/data/splunk/ -print > /opt/splunk/file_disk_01.txt
find /products/data/splunk2/ -print > /opt/splunk/file_disk_02.txt
## modify absolute path to then perform the comparison
vi file_disk_02.txt
:%s/splunk2/splunk/g
:wq
## sort the file to speed-up the diff command
sort file_disk_01.txt > file_disk_01_sorted.txt
sort file_disk_02.txt > file_disk_02_sorted.txt
## perform the diff (should not give any result if files are equal between disk1 and disk2)
diff file_disk_01_sorted.txt file_disk_02_sorted.txt
## also perform a md5 check (should give same hash)
md5sum file_disk_01_sorted.txt
md5sum file_disk_02_sorted.txt
splunk@myindexer:~ > md5sum file_disk_01_sorted.txt
31xx87c049fxx24b353xx8c45xx1b198 file_disk_01_sorted.txt
splunk@myindexer:~ > md5sum file_disk_02_sorted.txt
31xx87c049fxx24b353xx8c45xx1b198 file_disk_02_sorted.txt
#Before switch the disk and restart splunk check also if there are already some DISABLED bucket (just to avoid confusion if any will raise after splunk restart)
ls -la /products/data/splunk/db/*/db/DISABLED* | grep products
ls -la /products/data/splunk2/db/*/db/DISABLED* | grep products
#swicth disk (umount old disk, umount new disk, change mount point in /etc/fstab, mount new disk)
#restart splunk on Indexer
./splunk start
#check if DISABLED buckets are present (they should not be present unless you already have)
ls -la /products/data/splunk/db/*/db/DISABLED* | grep products
#Check Monitoring console on MN is OK
#Run some query over last 7 days on SH
index=_internal splunk_server=myindexer
index=_internal splunk_server=myindexer | stats count
index!=_* index=* splunk_server=myindexer
index!=_* index=* splunk_server=myindexer | stats count
index=_internal
index!=_* index=*
index=_internal | stats count
index!=_* index=* | stats count
#disable maintenance mode on Master Node
splunk disable maintenance-mode
splunk show maintenance-mode
#Run again some query over last 7 days on Search Head after Replication factor and Search Factor are met
How to recover DISABLED buckets
As indicated here it seems there are 9 digits available for the bucketID
https://community.splunk.com/t5/Deployment-Architecture/Max-Value-for-bucket-ID/m-p/76731
Example:
Splunk data are stored here SPLUNK_DB=/products/data/splunk
#Backup the DISABLED bucket
##list the folders
ls -la /products/data/splunk/db/*/db/DISABLED* | grep product
##create filelist.txt with the list of folders (remove trailing ":" from previous command if any)
tar -cvf /products/data/splunk/bckDisabled/bckbucket.tar -T /products/data/splunk/bckDisabled/filelist.txt
#Put cluster in maintenance mode on Master Node
splunk enable maintenance-mode
splunk show maintenance-mode
#Stop Splunk on your Indexer
./splunk stop
#Move the DISABLED folder into a non-disabled one increasing the BucketID (in this example from 105 to 100105)
mv /products/data/splunk/db/audit/db/DISABLED-db_1631192613_1630932002_105_3C08D28D-299A-448E-BD23-C0E9B071E694 /products/data/splunk/db/audit/db/db_1631192613_1630932002_100105_3C08D28D-299A-448E-BD23-C0E9B071E694
Note: Considering buckets are usually cancelled after retention time is reached, BucketID have to be higher enought that will never be reached in your environment.
#Restart Splunk on your Indexer
./splunk start
#Check data are searchable
index=_audit earliest=1630932002 latest=1631192613 splunk_server=myindexer
#Check bucket is searchable with a REST call
Check with REST
https://yourmasternode:8089/services/cluster/master/buckets/_audit~100105~3C08D28D-299A-448E-BD23-C0E9B071E694
you shoud see bucket searchable on your indexer
#Remove maintenance and check if replicated (once Search Factor and Replication Factor are met)
Check with REST
https://yourmasternode:8089/services/cluster/master/buckets/_audit~100105~3C08D28D-299A-448E-BD23-C0E9B071E694
you shoud see bucket searchable on more than one indexer (if you are on an Indexer Cluster) based on your SF and RF
Note: if you have DISABLED buckets on an Indexer Cluster they could be both db_* and rb_*
If you recover them all you could have duplicated data, but better having double than nothing.
Hope those guides will help you on planning a data migration or solving buckets conflicts.
Best Regards,
Edoardo
Hi
1st I haven't done this fix on (multisite) cluster.
Basically your analyse was quite correct. And how you should do on next time to avoid this?
The easiest way was use LVM on linux and then just add new disks to VG and then extend needed filesystem on disk. IMHO never use splunk without LVM! And also use splunk's volumes instead of point indexes to SPLUNK_DB/xxx. With those two you could avoid lot of issues.
Another way was add new node with above disk configuration and add it to cluster and then remove old node.
Last one was:
Then to your questions.
I cannot said for sure, how many digits there can be (and actually this changes by versions if I'm right), but at least 5-6 digits should be ok. I'm not sure if splunk start you changed bucket id as a new starting point on mlsc for bucketID. What I have seen is that there could be same counting bucketID, but as full bucket ID contains also those node GUID etc. then it's not issue to have same id on several nodes.
As I said not in production clusters. I have played with this on sandbox.
I'm quite sure that splunk didn't replicate those buckets as those are "old buckets". Splunk start replications as new buckets has created not for old ones.
r. Ismo