How to maximize the use of SSD for hot/warm data?

Admiral_Marith · ‎02-27-2017

Greetings fellow delvers of the deep data....

We recently made some changes to indexes.conf because we were not sure the config was doing what we wanted it to do.

The result of that poorly considered decision moved 2.6 TB of data out of hot/warm into cold, and another 400 GB into frozen that probably was prematurely aged out.

Let me preface this by saying for our size, we have a lot of SSD on our new indexers.

Current disk footprint

/dev/mapper/splunkdatacold    8.8T  4.4T  4.5T  49% /splunkdatacold
/dev/mapper/splunkdatafrozen   15T  2.5T   13T  17% /splunkdatafrozen
/dev/mapper/vg02-lv01         650G  414G  237G  64% /splunkdatamodels
/dev/mapper/vg02-lv00          11T  933G  9.1T  10% /splunkdatahot

The result of that grand data shuffle is that I have a 11 TB SSD volume for hot with 9.1 TB unused space. I also have 5 TB unallocated in the volume group that I could add.

I'm told that amount of SSD is somewhat uncommon, but we were going for a reasonably future proof configuration. I am about to cluster another indexer with the same space footprint into this, so I want to make sure we are utilizing the SSD effectively.

For the purpose of this question, I will stick to one of our big indexes.

We had these two settings and turning them off is what caused the multiple terabyte shift out of hot.

maxHotIdleSecs = 86400 
maxWarmDBCount = 6800

One of our bigger and typical indexes definitions.

[networks]
homePath   = volume:hot/networks/db
coldPath   = volume:cold/networks/colddb
thawedPath = $SPLUNK_DB/networks/thaweddb
maxTotalDataSizeMB = 5083636
homePath.maxDataSizeMB = 3389260
coldPath.maxDataSizeMB = 1694376
#explicit path to frozen directory
coldToFrozenDir = /splunkdatafrozen/networks

We had thought that getting a 2/3 to 1/3 ratio between hot+warm/cold would be done by specifying the max Total size, and then the homepath.maxDataSizeMB and coldPath.maxDataSizeMB so that those two values added up to the maxTotal value.

So the question is, how do you go about engineering a 2/3 to 1/3 split between hot/cold? We want to utilize the SSD's as much as possible, which is why we have been playing with the maxwarmDBcount parm and maxHotIdleSecs

Today we turned both of those back on and change the networks sizing a bit. Just to see what actually happens, because the previous configuration never pushed hot past 30%

Current parms in place:

maxHotIdleSecs = 86400
maxWarmDBCount = 6800

[networks]
homePath   = volume:hot/networks/db
coldPath   = volume:cold/networks/colddb
thawedPath = $SPLUNK_DB/networks/thaweddb
maxTotalDataSizeMB = 5083636    
#explicit path to frozen directory
coldToFrozenDir = /splunkdatafrozen/networks
maxTotalDataSizeMB = 5083636
homePath.maxDataSizeMB = 5083636

maraman_splunk · ‎02-28-2017

In complement :
- check the size in your volume:hot definition -> That's your main size criteria
- go to monitoring console , it will give you for each index, the criteria that will move your data from one step to the other.
- as the most constraining drive all, make sure the size on your volume:hot drive things (use auto_high_volume if high volume index as it look like from your behavior that you have too many buckets + reposition maxWarmDBCount = 6800 (or appropriate value) because that won't change instantly)

also as mentionned above, it probably make sense to have tsidx on ssd (one or 2 volumes, that's up to you)

koshyk · ‎02-28-2017

Key things to note
- no need to store cold data in SSD
- Move out Frozen out of SSD to cheaper storage or delete them
- DM is very important and should in SSD. Same as hot volume
- Most of the settings are at "index" level
- I would say try on the one of the important indexes and see how the impact is before applying on each indexes.

[networks]
homePath   = volume:hot/networks/db
coldPath   = volume:cold/networks/colddb
thawedPath = volume:cold/networks/thaweddb
# Assuming DM is part of SSD and volume:dm
tstatsHomePath = volume:dm/summarydb/datamodel_summary
# maxTotalDataSizeMB  (Assuming 100GB per day * 365 days + 2 days buffer )
maxTotalDataSizeMB = 36700000
maxHotBuckets = 10
# If you are receving more than 100GB of data into networks per day, use auto_high_volume which has 10G as bucket size
maxDataSize = auto_high_volume
# Assuming 100GB per day, store 400 buckets in warm which will be in volume:hot. (default is 300)
maxWarmDBCount = 400
# I'm not putting frozen settings, as it can be anything depending on your slower harddisk

Admiral_Marith · ‎02-28-2017

We are doing all od that, finally after a protracted discussion with our storage group. Your example really highlights the need to create a thawed path, so that's probably what we will use the vacated HDD space for.

nickhills · ‎02-28-2017

This is not so much an answer, as a question back to you.

The risks of consumer/desktop SSD endurance/wear out I 'believe' have all but been eliminated in most desktop use cases with modern SSD's but I am interested in if you have specifically opted for Single-Level Cell (SLC) NAND Flash over Multi Level Cell(MLC)?

It seems to me that your hot db is going to sustain significant write IOPs which will (over time) have an impact on the longevity of your storage.
Clearly with 9tb 🙂 of SSD you must be running quite a few "spindles" which will greatly mitigate the risk, but am fascinated by any sizing you did to estimate MTBF etc.
If you have this, is this something you would be able to share?

If my comment helps, please give it a thumbs up!

Admiral_Marith · ‎02-28-2017

Because of the inability of our data source owners to tell us how long their retention requirements were, and that our initial onboarding is all network/windows event log/security/active directory/Linux syslog stuff and falling under the interesting data for the enterprise security app, we decided to do 90 days searchable 1 year total retention of the data.

When we went physical hardware with the indexers we were having a bit of an argument with our storage group on the 'Yes, I said terabytes' request we were coming onto a situation where the first cut at a mix of HDD and SSD sizing, the original SSD sizing was predicted to be too small for expected license growth. We have two cisco 3260 M4 with 24 disk slots on it. had 7 900 GB SSD, and 9 900 GB 15K HDD (2as a raid 1 system drive and 7 as the HDD for cold). We have a 6 slot SSD RAID 5 array with one active hot spare.

Not really being able to pull the lower capacity HDD's to free up slots because we had still not gotten SAN storage for cold and frozen, we and then added 5 3.8 TB SSD's, which was the minimum number of devices to do a RAID 5 and have an active hot spare. So the available slot limitation made us go with higher density SSD devices in the second RAID and we have both raid arrays presented as physical volumes to LVM and in the same logical volume group.

the MTBF characteristics of the cisco enterprise class SSD offerings were considered sufficient for our needs with the raid 5 and hot spare configuration over the projected life of the server. When it comes time to replace these, we will re-architect the indexers based on better data, and the ability to use SAN.

We are using SSD for data models and hot/warm.

How to maximize the use of SSD for hot/warm data?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!