At this point, I'm just interested in knowing if Splunk will be able to run. Data onboarding will occur later, and that's when the computing power shall be scaled up to 16 CPU at minimum across all the Splunk components in use. Only 8 users shall be accessing Splunk, with only a few running concurrent searches after data onboarding is completed.
We are planning to only ingest 100 GB/day and having a retention period of 30 days, while older data shall be archived in object storage.
Components:
2 ES search heads, 2 ad-hoc search heads (split across 2 sites): 4 CPU each
4 Indexers (clustered across 2 sites): 4 CPU each
2 Cluster Masters (split across 2 sites): 4 CPU each
2 Deployment servers (split across 2 sites): 4 CPU each
2 Heavy Forwarders (split across 2 sites): 4 CPU each
Just an additional comment to the existing valuable replies.
When you install ES on the 4CPU host, disable all the data model accelerations until you have data. Also disable the data model acceleration enforcement
and disable all the saved searches you won't initially need.
This will prevent it complaining too much about slow performance.
Im going to be a bad guy here and state something that's obvious. If you are trying to run this in production with 4cpu, this is not supported and Splunk will not support this. With ES, you're using a Premium App from Splunk and if you're planning to use it as intended, you need to have the resources available for DMA etc.
Otherwise, why did you choose to go with ES?
More so, Splunk is not supported on a 4cpu system anyways. In a lab or dev, sure. But again if this is in production, you're going to have problems with scalability and usability which will make the users think that Splunk is a poor solution that they've paid a premium for. Stick to the minimum hardware spec as documented, at the minimum.
The OP did say he was going to increase the cores to 16 once data onboarding started...
Yes, Splunk ES will run on a 4-CPU system (I've done it), but performance will be poor. There's a reason Splunk recommends ES search heads have 16 CPUs and that's because of the load ES places on the system. Even with no data onboarded (what good is ES without data?) ES will still run several searches and each will use a CPU.
The component list is "interesting".
Why 2 ES search heads? Please don't say "HA/DR" or "redundancy". How will you keep them in synch? How will you avoid duplicated searches run by the two SHs?
Will the 2 indexer sites be in the same cluster or different clusters? If different, why? If you're striving for HA, use a single cluster spanning 2 sites.
Why 2 cluster masters? A cluster can have only 1 master.
Why 2 deployment servers? Do you have more than 50,000 forwarders to manage? That's the only reason to have more than one DS.
Having more than one HF is normal. Just be sure each is processing different inputs or you'll end up with duplicate data.
All of the components, except for the DS and HF are very under-sized. They'll work, but you won't be happy with them.
Hi @richgalloway - These are all great questions. I have included my responses below and would be interested to hear your thoughts.
Q: Why 2 ES search heads? Please don't say "HA/DR" or "redundancy". How will you keep them in synch? How will you avoid duplicated searches run by the two SHs?
A: The second search head shall reside in the DR site and splunkd shall not be running there, unless it needs to be started up in the event of a disaster in the primary/Prod site. This would prevent the second search head from being accessible. We are planning to have an rsync job replicate the knowledge objects from Primary to DR every 5 minutes.
Q: Will the 2 indexer sites be in the same cluster or different clusters? If different, why? If you're striving for HA, use a single cluster spanning 2 sites.
A: 1 index cluster containing all 4 indexers and this shall be spread across the Prod and DR site.
Q: Why 2 cluster masters? A cluster can have only 1 master.
A: The second cluster master shall act as a cold standby in case the primary CM is subject to a disaster and not recoverable.
Q: Why 2 deployment servers? Do you have more than 50,000 forwarders to manage? That's the only reason to have more than one DS.
A: The second deployment server shall act as a cold standby in case the primary DS is subject to a disaster and not recoverable.
Having more than one HF is normal. Just be sure each is processing different inputs or you'll end up with duplicate data.
All of the components, except for the DS and HF are very under-sized. They'll work, but you won't be happy with them.
I suspected one of the sites would be a standby site. I have a few more questions, comments, and suggestions.
What is the recovery time objective for the standby site? IOW, for how long can you be without Splunk?
Are these servers physical or virtual? If virtual, you may be able to use features of the VM provider (VMware, for instance) to handler server recovery for you.
Using rsync to keep Splunk instances in synch is an incomplete solution at best. This is partly why Splunk developed clustering (although clustering is intended as a scaling solution, not a DR one). Anything that uses KVStore (which includes ES) will not be replicated by rsync.
Splunk configurations should be managed in an external CM solution such as git. When it's time to start the standby system(s), load the latest config from CM first.
Since you will have an indexer cluster spanning sites, consider having a search cluster that also spans those sites. A cluster will keep itself in sync and spreads the search load across all members of the cluster, making better use of limited resources. Managing a cluster is different from managing a single SH, but (IMO) it's easier to manange a SHC than it is to manage 4 independent SHs.
To make cut-over easier, use DNS to route traffic to the DS, CM, and LM (license master, which was not mentioned previously, but is necessary). When the primary site goes down, DNS will redirect requests to the other site without having to change configurations in indexers and forwarders.
Here are my responses to your questions and comments. This is definitely a very helpful discussion!
Q: What is the recovery time objective for the standby site? IOW, for how long can you be without Splunk?
A: RTO for DR is 1 hour, so Splunk would need to be up and running again with an hour to meet our Availability SLA.
Are these servers physical or virtual? If virtual, you may be able to use features of the VM provider (VMware, for instance) to handler server recovery for you.
A: These are all Oracle Cloud VM's running on RHEL 7.x. You raise a good point about server recovery, so I'll check with the right folks about that.
Using rsync to keep Splunk instances in synch is an incomplete solution at best. This is partly why Splunk developed clustering (although clustering is intended as a scaling solution, not a DR one). Anything that uses KVStore (which includes ES) will not be replicated by rsync.
A: Search head clustering was recommended, but we're low on available CPU, so we're holding off on SH clustering just yet. As we begin ingesting data in the future and authorizing more users to access Splunk, we'll be in a better position to get the necessary resources for a SHC. We just need 1 more ES SH and 1 more ad-hoc SH for 2 SH clusters.
Splunk configurations should be managed in an external CM solution such as git. When it's time to start the standby system(s), load the latest config from CM first.
A: That makes sense.
Since you will have an indexer cluster spanning sites, consider having a search cluster that also spans those sites. A cluster will keep itself in sync and spreads the search load across all members of the cluster, making better use of limited resources. Managing a cluster is different from managing a single SH, but (IMO) it's easier to manange a SHC than it is to manage 4 independent SHs.
A: That makes sense. Once we have the resources to get a SHC built out, it shall cover both sites.
To make cut-over easier, use DNS to route traffic to the DS, CM, and LM (license master, which was not mentioned previously, but is necessary). When the primary site goes down, DNS will redirect requests to the other site without having to change configurations in indexers and forwarders.
A: That's a great point.