I am making a plan for organic splunk growth for the next year. The main question is:
How to calculate the growth factor of cpu, memory and disk for an increase of 100 GB of indexed data / day?
- Cluster environment: search head and indexer
- Search Factor = 2
- Replication Factor = 2
- 100 users
INDEXER SEARCH HEAD Count: 10 Count: 6 Vcpu.: 24 Vcpu.: 16 Mem..: 64GB Mem..: 32GB Disk.: 5TB Disk.: 300GB
I guess the question is, where are you coming from? How is this doing now, and how much data are you ingesting now?
Mainly because I see 10 indexers. I can't imagine how, from the indexer side of things, adding another 100 GB/day of indexed data will make any serious difference there. I mean, you can do what, a couple of TB/day already? 100 GB/day is 10 GB/day per indexer, and .. that's nothing. Heck 100 GB/day isn't even a high load for an indexer. But without knowing where they're starting from, I can't tell you where they'll end up (except it's unlikely to be much higher than it is now unless ....
... OH I SEE "v"CPU. Umm. Now I have some far more serious questions.
1) What is the average and peak CPU usage of the virtual indexers, as reported by the OS, and as reported by your hypervisor?
2) How many physical CPUs are in your hosts? How many Cores?
3) How many virtual CPUs are allocated on the typical host?
4) What does your hypervisor say your CPU Ready amount is for these VMs?
5) Is this shared storage?
6) If so, what are the IOPS that all 10 LUNS can sustain at the same time, for data larger than your SANs cache?
7) And when your nightly backups are happening, what IOPS can all 10 LUNS sustain at the same time?
8) How much Memory does your typical Host have unprovisioned, or are you overprovisioned?
When in a virtual environment, all this becomes far more of a concern. Folks have been led to believe that Hypervisors are magic, but they're not. The above is the list of things that come up all the time with respect to virtual systems and capacity planning - where things don't make sense. Like adding another indexer actually slows down ingestion because even though you have more CPU, you know just have yet another pathway to writing to the same 48 disks in a small SAN, and they can only do so much. Or doubling indexer count takes you from "Reasonable performance" to sucky performance, because CPU ready jumps off the scale because of too much overcommitment. Or adding 2 VCPU to a moderately well performing 10 vCPU box suddenly makes it's performance take a nose dive because you are now breaking a NUMA boundary on your host. Lots of little things that have no real parallel in standalone physical systems.
So, we need more information. 🙂
Hello @rich7177 ,
Sorry if I said it wrong. I am raising the requested information. I'm starting from indexing from ~700 GB/day to 1.5TB/day by the end of next year.
I want to find out how much I need to grow wm memory and cpu for an increase of 100GB/day. With this I would be able to measure the capacity of my environment and find out if the resources are sufficient or not taking into consideration 100 to 150 concurrent users.
Aha! That makes entirely more sense.
I'm sure we can come up with at least a few general guildelines or suggestions, but there's so many caveats to it that I'm not sure it'll be useful directly. Your best bet may be to contact your Splunk rep and get him to set up a capacity planning meeting with people who can take a direct look at your system and bring to bear what they've seen in the past. I'm positive they'll be happy to help.
For instance, "100-150 concurrent users" can vary from "Machine crushing levels of searches" to "Meh, Splunk running on my laptop could handle them" depending on what exactly they're doing. Well, maybe not 150 on my laptop, but you get the point. The disparity in load can be many, many orders of magnitude, but it depends on exactly what they're doing in your system.
I'd say the starting point is the MC. Get to know it well, learn what information is available in it that tells you how busy your machines are right now. See your ingestion queues, CPU usage by sourcetype, search load ... all that kind of thing.
Then I guess just do a bit of math. Without knowing any more, you could probably pretend that it scales linearly enough that if you have 500 GB/day ingestion and 50 users right now, assume you'll need 3x resources to do 1.5 TB and 150 users. This of course depends on the new data and users acting somewhat similarly to the existing ones, but ... well, you'll know better if that's unlikely to be true. But if it is, and you are currently running at 25-50% CPU on the indexers, you'll need to expand your indexing tier a bit because you'll in the future be running 75%-150% CPU, which is a bit high.
If you are just starting out and aren't hardly ingesting anything yet, nor have a large user load yet, then that's really where Splunk can help. They can talk to you about this sort of thing, try to give you an idea based on the data you have or are expecting to put in, and on the exact use cases you are expecting. We probably can't do that well here - it's probably 2-3 meetings over 2-3 weeks of an hour each for them to do this justice.