So, I'm slightly confused. I'm looking at the Splunk documentation and they reference only sending 50 GB/day to an indexer & scaling horizontally. But as I understand it, indexing is not CPU or RAM intensive. With that said, if you have a predictible indexing volume that won't change dramatically, wouldn't it make more sense to build a few very high IO boxes instead of many smaller ones? I mean a bunch of disk is cheaper per gained IO than entire servers.
The problem I'm having is I think a RAID 10 spinning disk backend & a RAID 1 SSD front end would be better than a bunch of cheap servers. Less rack space, cooling, & power with higher IO then the equivent priced "add more servers" approach. I'm even thinking of throwing away the Hardware RAID card as it is cheaper using an 8 core with hyperthreading and that CPU has room to spare given Splunk's CPU usage.
As I see it storage is about 3 things, volume (events-per-day), Depth (total days of storage), and availability (how many searches a second). For all of these problems I don't see how more cheap servers are better than a single high IO device as you spend a bunch on things not related to storing or searching bits on disk. The only rational I can see behind the whole "bunch of cheap hardware" approach is when you don't know how much you're going to be adding or when, so it makes adoption easier to chew. The downside with it is in 3 years of aggressive adoption it all needs to be ripped out for better hardware. Is there something I'm not understanding?
It is based on a "biggest bang for the buck" calculation. I don't know how Splunk comes up with the numbers exactly, but they are looking for servers in the "sweet spot" in terms of performance vs. cost. Splunk has a great design for scaling - it scales horizontally with ease. And horizontal scaling is almost linear, which is pretty impressive.
The recommendation that I have seen is 200 GB/day for indexing, if the box is "lightly used" for search. Some apps, such as the Splunk Enterprise Security app, are doing massive data correlation - this will have an impact on how much indexing can be done. Here is a great blog article on Splunk Sizing and Performance.
Finally, more boxes means more hardware resilience and faster recovery time if you lose a box. Of course, there are a ton of considerations: are you replicating data, what is your tolerance for failure, what is your physical environment (number of racks, isolation of racks, number of sites, network bandwidth, latency, etc. etc.)
Of course you can just buy a honking big box (or two) and do everything on it. Any recommendation in a manual has to be generic; it has to apply to most situations most of the time. Many companies implement these recommendations without problems. It's up to you to apply your knowledge of your environment and tune the recommendations to fit.
Indeed, there's a zillion variables. To make things even more complicated, the indexers take a good chunk of the search load - ideally almost all of it. This load can be CPU-bound or IO-bound, depending on the types of searches being run. For example, a dense search with basic reporting (say
stats count by field) run from one search head onto a few indexers might load up two cores per indexer, a low amount of IO per indexer, and virtually nothing on the search head. More indexers speed this up dramatically, a mahoosively large/fast single indexer would be slower than these cheaper indexers. Most systems I've seen, the indexers are more busy with searching than indexing. Don't underestimate this, unless your Splunk is intended as a write-only sink for legal/auditing/forensic purposes only (insert sad splunkface here).
For other search types and their impact on the machines see this: http://docs.splunk.com/Documentation/Splunk/6.2.0/Capacity/HowsearchtypesaffectSplunkEnterpriseperfo...
That entire docs book is a good read, have a look at it in its entirety if you haven't already.
If you need more specific help with sizing your environment there's loads of ways your local partners or Splunk SEs can help.
Agree with above, there are many variables.
Very simply put from my experience if search performance is at all important go for as many indexers as you can wrangle the budget for (yeah, helpful statement... I know). If I had my time again, I would have spread our hardware budget across more, lower spec'd servers (fewer cpu cores, less memory and less disk capacity (Still highest iops as possible). From my understanding search is not multithreaded which is what I see as the issue with a single massive server.
upvote for the evocative phrase "...insert sad splunkface..."
imagining an indexer moaning, "I have all this data, and the only people I can tell it to are LAWYERS..."
Vertical scaling normally refers to adding more power to a single server (ie. CPU/RAM/better storage performance), where horizontal scaling would be adding more servers to the solution.
Just read the nice blog article by Patrick Ogdin on Splunk Sizing and Performance here , and I have to say that somewhere on any page that talks about the "current" recommendations, should be the date of the darn article. Yes, eventually someone might figure out it's embedded in the URL, or a resourceful person might click to the author's page and see the date of the article on the link there, but it really ought to just be visible in plain text.