Deployment Architecture

Splunk cluster hardware planning

jpillai
Path Finder

We are planning to upgrade our Splunk hardware. We currently have below(multisite indexer cluster with independant search head clusters) and we are facing problems with low cpu count and high disk latency(we currently have HDDs). We primarily index data through HEC.

 

TypeSiteNumber of nodesCPU p/v (per node)memory GB (per node)
SH cluster1416/32128
Indexer cluster1114/864
Indexer manager/License master1116/32128
SH cluster2416/32128
Indexer cluster2114/864
Indexer manager/License master2116/32128

 

Daily indexing/license usage 400-450GB which may grow further in near future

Search concurrency example for one instance from 4 node SH cluster

Screenshot 2025-01-06 at 8.20.37 PM.jpg

 

We are trying to come up with the best hardware configuration that can support such load.

 

Looking at Splunk recommended settings, we have comeup with below config. Can someone shed more light on if this is an optimal config and also advise on the number of SH machines and indexer machines needed with such new hardware

Site1: 3 node SH clusters, 7 node idx cluster

Site2:  As we are using site2 for searching and indexing only during unavailability of site1, may be it can be smaller?

RoleCPU
(p/v)
Memory
Indexer24/4864G
Non indexer32/6464G
Labels (2)
0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @jpillai ,

two main things:

4/8 CPUs are very few for Indexers that should have at least 12 CPUs each one (if you don't have ES or ITSI).

You should analyze your requirements, with special attention to especially input next growth  and the number of scheduled searches and concurrent users, because usually it's used one IDX every 200 GB indexed (less if you have ES or ITSI), so you have too many IDXs.

In addition you should analyze the performances of your disks (storage and system disks) to find the correct number of IDXs, because you need at least 800 IOPS better if more!

About configurations, SHs usually require more CPUs than IDXs, So I'd use (if you don't have ES or ITSI):

  • SH and IDX: 24/48 CPUs 64 GB RAM,
  • HF, CM, SHC-D, MC and DS: 12/24 CPUs 64 GB RAM. 

About the secondary site, as also @dural_yyz said, the secondary site, in the normal activity) is mainly used for the data replication, but you should analyze also the worst case, so I'd use the same configuration of the main site.

Then, the Cluster Manager isn't required so performant and it must be only one in the cluster.

In other words, you can have only one CM because the cluster continue to run also if the CM is down, eventually having a silent copy to turn on if the Primary Site down is longer that predicted.

At least, I don't see in your infrastructure SHC-Deployer, Monitoring Console and Deployment Server for which you can apply the same considerations of the Cluster Manager.

Ciao.

Giuseppe

0 Karma

dural_yyz
Builder

The way that hot/warm/cold buckets along with bucket replication works it is in your best interest to make site 1 and site 2 indexing tier identical.  Someone with advanced on prem admin experience would be able to size this but storage becomes you biggest concern with unaligned resources.

If you have some sort of business or budget constraints then I get why you would have unaligned sites - however personally I would very strongly suggest that both sites be identical compute and storage capacity at the indexing tier.

Your individual indexer CPU count will determine how many concurrent searches can be run.  The compute power of your new machines appears acceptable from the minimal information available.  Keep an eye on skipped searches to confirm - the internal logs will indicate a skip reason.  Ideally SH and IDX should keep similar if not exact same CPU cores.

0 Karma

jpillai
Path Finder

Yeah budget is a concern. Given the fact that the secondary site will only be used during a site1 failure, most of the hardware will just be sitting there without much activity except for may be indexers doing some replication. So I am trying to see how we can minimize hardware at site2. We probably be using site2 for indexing and searching for may be few hours over a period of months when site1 is down or under maintenance.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...