Deployment Architecture

How to reduce hot and cold volume safely?

m_zandinia
Path Finder

Hi Splunkers

I have 4 indexers in my cluster environment and 8 TB for hot volume, 4 TB for summaries, and 160 TB for cold buckets, totally. I don't have a freeze path. 

This is my indexes.conf as well.

 

 

[volume:HOT]
path = /Splunk-Storage/hot
maxVolumeDataSizeMB = 1900000 # ~2 TB

[volume:COLD]
path = /Splunk-Storage/cold
maxVolumeDataSizeMB = 40000000 # ~40 TB 

[volume:_splunk_summaries]
path = /Splunk-Storage/splunk_summaries
maxVolumeDataSizeMB = 950000  # ~1 TB

 

 

 

Now, I want to add 4 indexers but I can't increase my volumes due to some limitations. So, I have to split my total space between 8 indexers instead of 4 indexers.

What is your suggestion to avoid possible data loss and minimum down time?

Labels (1)
Tags (3)
0 Karma

m_zandinia
Path Finder

Thanks for replies

I actually missed some facts.

First of all, my daily data ingestion is something about 3-4 TB per day.  Total events, for example, on Wednesday, are around 3,500,000,000.

There isn’t any free space in the HOT & COLD spaces.  and there is only 80 TB temporarily that I can designate for COLD, and 4 TB for HOT. On the other hand, the current COLD space is just one raw Vmware file. Because of the type of COLD space, the Infra team says we can’t decrease the COLD space, and the server must go down, destroying the file, and designating again.

Let’s assume we have 4 indexers with these names: indexer1, indexer2, indexer3, indexer4, and want to add indexer5, indexer6, indexer7, and indexer8.

So, this is the procedure that I think about

1. Bringing up indexer5 to indexer8 and designate HOT=1TB and COLD=20TB from the temporary space

2. Bringing down indexer1

3. Rebalance the data and waiting to cluster being established

4. Bringing up indexer1 with HOT=1TB and COLD=20TB

5. Rebalance the data and waiting to cluster being established

Repeat these steps until all the indexers are up.

So the question is at what step I must reconfigure my volumes (in indexes.conf) down to HOT=1TB and COLD=20TB to avoid data loss? And do you think using the frozen path is a good idea to avoid data loss?

0 Karma

PickleRick
SplunkTrust
SplunkTrust

It ain't that easy. Firstly, all cluster member must be of the same specs (and indexes.conf are managed by CM). If your current volumes are 40T in size and you'll make the underlying storage just 20T, Splunk will not like that.

Another thing is that if you'll keep the cluster active (and I suppose you want to minimize downtime as well), it will keep indexing data and not only it will roll existing buckets to frozen, but would also fill the space you've just freed. You could put old indexers in detention and point your forwarders only on new indexers but that way you'll be rolling to frozen buckets you've just replicated due to lack of space. Generally speaking, Splunk is not meant to shrink.

And remember that while you're shutting down your last "old" indexer you have no space to move buckets from there.

What are your RF/SF values? If it is 2, you could drop it to 1 temporarily (yes, it is risky), shrink old indexers, add new ones raise RF/SF back.

 

m_zandinia
Path Finder

Thanks

I do know this isn't gonna be easy!

The RF=3 and SF=2. 

There is another way I am thinking.

If I can acquire 160TB for COLD data maybe this approach is gonna work.

1. Down HOT to 1 TB (Because I just have 4 TB temporary and it will not work)

2. Letting Cluster to move data from HOT to COLD

3. Bringing up 4 new indexers with HOT=1TB and COLD=40TB

4. Rebalance the data to spread COLD data on 8 indexers.

5. Then down COLD to 20TB

This is much more like @woodcock suggested.  But because another problem I mentioned, the weird thing is the COLD area in just one Vmware file and the file itself can't be shrink! although I decrease the volume size in Splunk and shrink it in Linux, the infra team still have 40TB used space, so after all of this I have to delete the COLD space one by one and create it again with the size=20TB! 

Anyway, I suppose this is the safest approach. I appreciate your suggestion @PickleRick @woodcock 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Hot isn't that much of a problem because if you lower your hot/warm volume size, buckets will simply roll to cold so no harm here.

If you add temporary space so that you have 40TB of cold space on each indexer and rebalance buckets - yes, this should more or less work (remember that rebalancing never results in 100% even distribution of space - if not for any other reason - simply because buckets have different sizes). And after that if you reconfigure your volume limits to 20TB, of course the oldest buckets will get rolled to frozen (again - I assume you want to do those operations on-line which means your cluster will still be ingesting data).

And here's the tricky part - even if you have shrunk your "logical volume" in splunk settings, you still have to at some point bring an indexer down, and shring the underlying storage. What I would probably do at this point is enable maintenance mode on the cluster so that it stops bucket fixups, bring the indexer offline, copy out the data, destroy and recreate the filesystem on your splunk block storage, copy the data back, bring the indexer online again.

0 Karma

m_zandinia
Path Finder

Thanks @PickleRick 

Yes, I know the tricky part is how I shrink my underlying filesystem because at the end of the day it's about the space must be free and available at the hypervisor level.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

There is something you're not telling us here 😉

If you have some space which is currently spread across 4 indexers and you want to distribute it across 8 of them, it must be shared. How? NFS? Not a very good idea in the first place.

Anyway, if you can increase your consumption temporarily, @woodcock 's solution is the way to go. If not, if you want minimal amount of data rolled to frozen in the process, lower the volumes by 20% each, deploy a new indexer, rebalance buckets, shrink by another 17%, deploy another indexer, rebalance and so on.

0 Karma

woodcock
Esteemed Legend

This is cake, assuming that you are using index clustering with volume settings.

Deploy the 4 new indexers with the same existing settings.

Let the Cluster Master get everything stable. Because you are increasing that space by 50%, albeit temporarily, you should not see errors regarding space/write problems.  Even if you do, the cluster master should appropriately react.

Do a data/bucket re/balance from the GUI of the Cluster Master (might take a while).

After that settles, reduce the hot/summary volume settings by 50% and push out the settings.

nickhills
Ultra Champion

The important part (and missing from your post) is how much of that allocated space is actually "in-use"?

If you are using all 8TB of hot and 160TB of cold, then there is no way around it - you need more disk.

However, if your true usage of space falls below this, then you can set your limits to match the space that is available.

For example. if your only using 4TB of your 8TB actual hot-device you can safely amend the maxVolumeDataSizeMB to 1TB on each host.
You should also understand that if you set the maxVol.. to a value smaller than what you have in current use will only cause that data to roll to cold - not be removed.

The limits only tell Splunk what to do at those limits, so as long as you have a small amount of Cold space you can add your new indexers leaving the cold  maxVol... set to 40TB each (320 total)

Let the cluster replicate and balance, then when that process is complete you can reduce it to 20TB each.
But, Ideally your retention should be dictated by dates rather than volume. I tend to think of the maxVol.. setting as a safety net incase you are getting close to the physical limits of the hardware, but what really drives my storage is how long i need to keep that data for.

 

If my comment helps, please give it a thumbs up!
0 Karma
Get Updates on the Splunk Community!

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...

Explore the Latest Educational Offerings from Splunk [January 2025 Updates]

At Splunk Education, we are committed to providing a robust learning experience for all users, regardless of ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...