Deployment Architecture

Does Splunk have recommended storage and archiving policies?

vanderaj1
Path Finder

Hi Splunkers!

I recently inherited responsibility for a Splunk Indexer and I'm not so sure that the disks,data backup and archiving is set up according to Splunk's recommended practices.

The indexer is directly attached to 7200 RPM near-line SAS disks (semi-expensive, 2TB, slow disks). I see 4 mount points corresponding to the attached storage: /data01, /data02, /data03, /data04. $SPLUNK_HOME lives on /data01.

The current backup strategy involves running a script on a weekly basis. This script basically takes a snapshot of everything in $SPLUNK_HOME into a .tar.gz file and then moves that .tar.gz file to /data02,03, or 04 (whichever one happens to have the available space). I really would like to do away with this strategy, and set up cold & frozen buckets on /data02,03 (and perhaps 04) instead. That seems more in line with everything I've heard from the Splunk articles I've read. Can anyone confirm this?

Also, I'm not so sure that $SPLUNK_HOME with its hot & warm buckets should reside on slow, external DASD storage like this. Can anyone confirm this too?

Thank you!

0 Karma

vanderaj1
Path Finder

Hey All,

Thanks for taking time out to reply back with feedback on this. I appreciate it!. 🙂

Knowing suggesting a possible answer often depends on more specific details, here's the scoop on our setup:

Approximate daily data ingest rate: 7 GB
Data retention requirements: "online", searchable data - 90 days, "offline", non-searchable frozen data - 1 year

CPUs: 8 cores, utilization typically stays below 50% (when watching top for a bit, it hovered around 20-25%)
RAM: 16 GB
-Disks:
• Storage is a RAID5 grouping of seven physical 2 TB, 7200 RPM, SAS near-line disks
• Available storage after RAID is 10.913 TB
• This storage is partitioned into 4 logical volumes, each of size: 2,793.774 GB
• We see these four volumes as /data01 through /data04
• There is no difference in size, speed, or capability between these four volumes
• All four volumes continuously compete for the attention of the same seven physical disks

We only need backups in order to meet our data retention needs or to reconstitute the indexer in the event of an outage. We could move backups elsewhere if necessary.

Should we continue to keep the hot & warm buckets on /data01 and place the cold bucket on /data02, frozen on /data03?

Knowing that our disks aren't particularly fast, do we need to consider faster disks for our hot and warm buckets?

Thanks!

0 Karma

Richfez
SplunkTrust
SplunkTrust

Oh, almost forgot!

Play a little with dragging the sliders around at this splunk sizing thingamabob. It's totally wrong, but also it's completely right. And sometimes it's even close. But in any case, it's something. 🙂

0 Karma

Richfez
SplunkTrust
SplunkTrust

This is just my opinion, which is wrong and perhaps even absurd. Many other folks on here have a lot more experience in this area, though on the flip side it is often experience with 100 GB/day - a whole different playing field than <10 GB/day.

At 7 GB per day, you can keep a year's worth of data searchable for only a bit more than a TB of space... I can put a few GB/day into a VM on my laptop without stressing it too much. 🙂

While rebuilding to RAID 10 would be ideal (see note 1), you can probably get by with R5. It won't be super fast but it should be adequate (and honestly, R10 will likely only be slightly better- really, you just aren't doing a lot of writes at 7 GB/day). Taht should give you ~50-100 write IOPS and 200+ read IOPS, which is OK for your current size. Where R10 would shine is if you expand in the future to 10 or 20 GB/day. You aren't likely to be happy at 50 GB/day with your hardware. In fact I'll say I'm pretty sure of that. But with R10 you have a LITTLE more wiggle room with respect to future expansion. You'll start bumping into RAM and CPU limitations before you get more than, oh, double or triple what you are doing, but under that you will probably be OK. (For variously loose definitions of "ok")

I was going to suggest getting rid of those extra volumes and just make it one big pile of data. The reasoning was going to be that you'll a) have a lot less management to do and b) Splunk will have little "work" to do to shift buckets between locations (no actual file copies!) and that will help performance some more. Plus, if you are using those as backup spots for longer retention, well, just use them for longer retention instead of making backups and keeping those.

But at 7 GB/day, the 2.7 TB of any one of those partitions will cover you for 2 years of hot/warm AND cold - all searchable. I'd just leave it as it is (knowing that you can likely whack any one of those partitions and expand your data01 partition into the space if you actually needed more room). Sure, you can frozen that stuff off to another disk, that's fine.

Lastly - DON'T PROMISE MANAGEMENT it'll be fine. Be clear that it's a rather minimal configuration and will probably work for the time being, but that these things grow! As it grows you'll bang into the limits of this particular system. You can keep your eye on the DMC and watch how your system responds as stuff gets added, people start using it and so on, and with that information you can make much better recommendations for next budget cycle what ought to get put into place as a replacement. Or you'll find it works just fine for your needs so why spend money you don't need to? Or maybe they'll spring for the second CPU and doubling RAM. In any case, you'll have real data to go on instead of a random internet-person's ravings, as lucid (and handsome!) as he may be.

Note 1, if you want to rebuild the whole system - use 6 of the disks in R10, leave one as a local backup of at least configs, partition the RAID set as one big disk inside your OS - or a ~50 GB system partition and everything else as a big hunk o' data. That will give you 5 TB or so of usable space fast enough to not be insufferable most of the time depending on the searches folks want to run. (Again, assuming you stay under 10 or 20 GB/day). This will give you perhaps 150-200 write IOPS and 200 read, which is reasonable for your <15 GB/day.

Richfez
SplunkTrust
SplunkTrust

As a start, and not knowing anything but what you've said:

IF you can move backups to somewhere else completely, and
IF each mount point is a pair of 2TB NL disks in a mirror pair (8 disks?), and
IF your ingest rate is above about 10 GB/day, and
IF you have 64 GB of RAM in it (or MORE if it's a Search Head as well), and
IF you have at least 10 or 12 reasonably fast cores, and
IF your search load is reasonably low (only a few "normal" tightly bounded searches usually),

THEN, possibly, what I'd do is rebuild your disks to have them all in a RAID10 setup to optimize your overall IOPS. In that case, leave your hot, warm and cold all in the same place because you want less activity in migrating files between them (on the same file system it's just an inode change, on different file systems it's an actual file copy.) Leave frozen at its default so they just get deleted (frozen == deleted, generally, or archived elsewhere if not).

8 NL SATA disks like this will do around 300 IOPS, which is below the reference hardware but should be sufficient for smallish loads of a few to a few tens of GB/day.

If you are under 5 or 10 GB/day, you could do about anything you want and as long as your search load isn't great you'll probably be fine. In that case, build for enough redundancy - so if it's mirror pairs underneath your mounts, you may be fine like you are. If it's single disks (4 total) and you can move backups elsewhere, rebuild to R10 across all disks.

If you need data backups, and if you can't move them elsewhere... perhaps you could leave the last mirror pair as a backup set and use the 6 others as "everything but backups" in RAID10.

If there's only 4 disks - that's bad. That means you aren't even running mirror pairs, so losing a single disk actually kills things. In that case please R10 all 4; there's no other solution I'd suggest. If you require data backups for some reason, but can't move them to a real backup system and you only have 4 disks, I don't know what to tell you. Mirror pairs and prayer? Get a budget?

If you actually have a lot more than 8 disks (e.g. each mount point is 4+ disks), then you are doing a lot better. With 12 disks, I'd use 4 in R5 for backups (if you can't move backups elsewhere) then the rest in R10 for data/stuff. Or fiddle the numbers around a bit. Or maybe use 2 for backups - and ONLY back up the configuration, not data - and all other disks as a big pile of data. Another option with 8+ disks and the ability to move backups elsewhere might be to mirror the system disks and then RAID 10 all the rest as your data. Or mirror the system disks, use on disk as a config backup disk, all but one of the remaining in R10 for data, and the last one configure as a hot spare.

There are a lot of options, but what they primarily come down to are how many spindles/disks do you actually have available, what's your approximate ingestion, what are typical search loads and how beefy is the rest of the system?

Richfez
SplunkTrust
SplunkTrust

While some guess at an answer can probably be given, it would help accuracy a lot if you could provide some/much/all of the following:

  • What is your approximate daily data ingest rate? (Can find it from the licensing page, just eyeball the 30 day history)
  • How much data is on the box (df will likely be close enough for our purposes)
  • How much retention do you actually need? Mostly 6 months? 2 years?
  • How many CPUs, and what's the current utilization (top for a while will tell you)?
  • How many disks are in this system, and how many disks are under each mount point, in what configuration (Raid level, which you might have to work a bit and jump into the Raid card setup during a reboot to find out)
  • How much RAM?
  • And how many local "backups" do you want to keep of the data itself? Can you put them elsewhere?

pgreer_splunk
Splunk Employee
Splunk Employee

As @rich7177 stated (in not so many words, but basically) is "it depends". Best practices are guidelines at best, how best to utilize them to optimize your environment depends on many factors such as ingest rate, data retention requirements of the use cases and business, infrastructure you have available to you (and or can get - aka budget), etc.

Some best practices links on the Splunk Wiki:

https://wiki.splunk.com/Things_I_wish_I_knew_then
http://wiki.splunk.com/Community:More_best_practices_and_processes

Get Updates on the Splunk Community!

3 Ways to Make OpenTelemetry Even Better

My role as an Observability Specialist at Splunk provides me with the opportunity to work with customers of ...

What's New in Splunk Cloud Platform 9.2.2406?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2406 with many ...

Enterprise Security Content Update (ESCU) | New Releases

In August, the Splunk Threat Research Team had 3 releases of new security content via the Enterprise Security ...