I found a doc on the Cisco website which specifies an architecture with about 14 UCS servers to index up to 2 TB/day : 8 indexers, plus 3 search head servers, plus three other servers (Master Node, Deployment server, plus archival Node). I priced out the total Cisco Ref config at almost $500,000. I've seen other docs which suggest that I need 8 indexers and 1 search server per 150 GB/day. That's going to be 64 = 8 = 72 servers (plus all the other config hardware). Which is correct?
As JTacy says, please reach out to your Splunk Account Rep you are working with. We will gladly sit down and discuss architecture and growth. There are a lot of factors involved in sizing deployments your size, so its better brainstorm together. HA and DR are different goals also, and different methods to accomplish each with Splunk and other hardware solutions...
Will do - but still in the information gathering stage.
Some of the topics are advanced but there's interesting insight into running Splunk in the real world in the 2016 conference presentations:
https://conf.splunk.com/sessions/2016-sessions.html
The Cisco recommendations sound appropriate. 8 indexers per 150 GB/day is way off. This Splunk summary may also help:
http://docs.splunk.com/Documentation/Splunk/6.5.1/Capacity/Summaryofperformancerecommendations
That said, there are a lot of variables including CPU speed, disk performance, the size and type of events being ingested, what kind of processing you're doing on the indexer side, how many searches are running, etc. Also note in the Splunk doc that the recommended number of required indexers goes up as the search load goes up; in addition to constantly handling incoming data, indexers do a lot of work to support search. Either way, Cisco's recommendations are a lot like Splunk's and seem like a reasonable starting point if you're going the UCS route.
I've been hunting around, and see several references to using one indexer per 100-250 GB/day (holding server variables config constant). Regardless, I would trust the CISCO doc as the reference platform design (so 14 servers for up to 2 TB/day), but the CISCO design still means per indexer performance is about 7,500 EPS. Moreover, I have verified that the CISCO reference doc does NOT specify an HA config, so that $500,000 hardware config for my 1.5 TB/day need just went up by at least 2x. There has got to be a more robust and cost effective solution.
A single indexer can pull in a lot of data; in testing with Cisco ASA logs I've been able to index over 75K EPS on a single, modest indexer (8 core VM on 2 GHz host CPU). However, that VM wouldn't be able to do much else in terms of running scheduled searches, ad-hoc searches, etc. These logs are easy to index, too; indexing JSON events with 100 indexed fields is probably going to be a lot slower. I believe the scaling recommendations are based on real-world, general-purpose use that will result in a reasonable user experience in most environments, not just to meet an arbitrary incoming data target.
HA in Splunk generally doesn't mean doubling the number of servers and I'm not certain you would need to add a single indexer to the Cisco specs. The indexers stream copies of data to other indexers as it comes in and while it costs a lot of disk, I wouldn't expect it to dramatically affect your ability to index at speed. You configure the cluster to handle an outage of up to n indexers at a time so you end up with duplicate data that's used in the event of indexer failure. You get to decide exactly how much redundancy you want with Splunk. One nice attribute of Splunk HA is that even if you choose to add indexers strictly to meet a redundancy target (again, not necessarily required), those indexers will always be active so your users can enjoy better performance at all times and you don't have hardware sitting idle waiting for something to break.
At your scale I would definitely consider sharing more detailed requirements with Splunk so they can help determine whether the general recommendations apply to you. I imagine Professional Services would also be part of the contract with a license this large. If you don't have any Splunk in your environment already, I would at least prototype your workload on VMs or whatever hardware you currently have to get an idea of how things scale; it's an interesting system!