So, I'm slightly confused. I'm looking at the Splunk documentation and they reference only sending 50 GB/day to an indexer & scaling horizontally. But as I understand it, indexing is not CPU or RAM intensive. With that said, if you have a predictible indexing volume that won't change dramatically, wouldn't it make more sense to build a few very high IO boxes instead of many smaller ones? I mean a bunch of disk is cheaper per gained IO than entire servers.
The problem I'm having is I think a RAID 10 spinning disk backend & a RAID 1 SSD front end would be better than a bunch of cheap servers. Less rack space, cooling, & power with higher IO then the equivent priced "add more servers" approach. I'm even thinking of throwing away the Hardware RAID card as it is cheaper using an 8 core with hyperthreading and that CPU has room to spare given Splunk's CPU usage.
As I see it storage is about 3 things, volume (events-per-day), Depth (total days of storage), and availability (how many searches a second). For all of these problems I don't see how more cheap servers are better than a single high IO device as you spend a bunch on things not related to storing or searching bits on disk. The only rational I can see behind the whole "bunch of cheap hardware" approach is when you don't know how much you're going to be adding or when, so it makes adoption easier to chew. The downside with it is in 3 years of aggressive adoption it all needs to be ripped out for better hardware. Is there something I'm not understanding?