Archive

What is the best way to deploy splunk on very strong machines?

Explorer

Hi we have 6 physical machines with 512Gb RAM, 56 CPU, and 10 disks with 12Tb each.
We want to use them for splunk indexer cluster, but we are not sure what is the best way to do this.
Our problem is that we dont know how well splunk uses this large amount of memory and cpu, and also we dont want to have one logical volume that will contain all the disks space.
Knowing that elasticsearch is recommended to run as multiple instances on a single machine we have thought about it, but its troublesome to configure it properly.
We also thought of running a few splunk instances over docker.
What is the best approach to this?

0 Karma
1 Solution

Splunk Employee
Splunk Employee

This question cannot be credibly answered without knowing answers to some key questions:

  1. What is your expected daily data ingest?
  2. What is your data retention policy, i.e. how many days of log data do you need to keep?
  3. What is your expected concurrent search volume?
  4. What HDD drives do these boxes have? 5k/7k/10k rpm?
  5. Are you planning on using Premium Apps, like Splunk App for Enterprise Security?

While possible, we typically do not recommend running multiple Splunk instances on the same physical server anymore, since our more recent product versions feature multiple parallelization settings that can be used to exploit available hardware better. Also, administering this, especially in a clustered environment, comes with one or three headaches (if the hardware dies, you lose two peers, different IPs/different ports, etc.). Also, running Splunk Enterprise in Docker is not supported at this point in time, so probably not your best choice for production unless you don't expect Splunk support to help you when issues arise.

Your key constraint is likely going to be available disk IOPS (specifically random read). Even if your drives are 10k drives, a RAID10 array with 10 drives would give you around 900 IOPS max (at an assumed 60:40 R/W rate). That's about what you need for a single Splunk indexer instance. Splunk has a very unique and demanding I/O profile with constant streaming writes (indexing) and completely unpredictable and often random read operations. Any disk subsystem that provides less than 1000 IOPS is likely not do well with multiple Splunk instances, whether they are containerized or not.

If you have a data high ingest rate with a relatively low search workload, you may be able to configure two ingestion pipelines and process upwards of 600GB/day per indexer (indexing is largely a streaming write operation). Two indexing pipelines at full capacity will utilize between 8 and 10 cores, so you still have a ton of cores left for search processes (remember, indexers are also search peers and search is what stresses your I/O most).

If you have a relatively low data ingest rate, but a very high concurrent search volume, you will be able to utilize cores for searches as long as your disk can keep up. A lot depends on the types of searches as well. Dense searches that require a lot of buckets to be retrieved from disk and unzipped are taxing CPU cores more than sparse searches that can use index files to filter down the number of buckets we need to open to satisfy the search.

What will help you a lot here is the amount of RAM your system has. The more RAM, the bigger your OS file system cache and the higher the likelihood that searches over recent data can be satisfied out of the cache vs. having to bother the disk.

To summarize: While it is these servers have a ton of memory and CPU resources, the disk subsystem is likely your limiting factor. As soon as you hit I/O waits, your performance will take a nose dive, no matter how many CPU cores are ready to do work.
Any workaround (virtualization, multiple instances, containers) is dependent on that shared resource and likely not going to improve your overall system performance.
If your servers had SSDs instead of "spinning rust", the conversation would be completely different.

I hope this helps to shed some light onto the question you raised.

View solution in original post

Splunk Employee
Splunk Employee

This question cannot be credibly answered without knowing answers to some key questions:

  1. What is your expected daily data ingest?
  2. What is your data retention policy, i.e. how many days of log data do you need to keep?
  3. What is your expected concurrent search volume?
  4. What HDD drives do these boxes have? 5k/7k/10k rpm?
  5. Are you planning on using Premium Apps, like Splunk App for Enterprise Security?

While possible, we typically do not recommend running multiple Splunk instances on the same physical server anymore, since our more recent product versions feature multiple parallelization settings that can be used to exploit available hardware better. Also, administering this, especially in a clustered environment, comes with one or three headaches (if the hardware dies, you lose two peers, different IPs/different ports, etc.). Also, running Splunk Enterprise in Docker is not supported at this point in time, so probably not your best choice for production unless you don't expect Splunk support to help you when issues arise.

Your key constraint is likely going to be available disk IOPS (specifically random read). Even if your drives are 10k drives, a RAID10 array with 10 drives would give you around 900 IOPS max (at an assumed 60:40 R/W rate). That's about what you need for a single Splunk indexer instance. Splunk has a very unique and demanding I/O profile with constant streaming writes (indexing) and completely unpredictable and often random read operations. Any disk subsystem that provides less than 1000 IOPS is likely not do well with multiple Splunk instances, whether they are containerized or not.

If you have a data high ingest rate with a relatively low search workload, you may be able to configure two ingestion pipelines and process upwards of 600GB/day per indexer (indexing is largely a streaming write operation). Two indexing pipelines at full capacity will utilize between 8 and 10 cores, so you still have a ton of cores left for search processes (remember, indexers are also search peers and search is what stresses your I/O most).

If you have a relatively low data ingest rate, but a very high concurrent search volume, you will be able to utilize cores for searches as long as your disk can keep up. A lot depends on the types of searches as well. Dense searches that require a lot of buckets to be retrieved from disk and unzipped are taxing CPU cores more than sparse searches that can use index files to filter down the number of buckets we need to open to satisfy the search.

What will help you a lot here is the amount of RAM your system has. The more RAM, the bigger your OS file system cache and the higher the likelihood that searches over recent data can be satisfied out of the cache vs. having to bother the disk.

To summarize: While it is these servers have a ton of memory and CPU resources, the disk subsystem is likely your limiting factor. As soon as you hit I/O waits, your performance will take a nose dive, no matter how many CPU cores are ready to do work.
Any workaround (virtualization, multiple instances, containers) is dependent on that shared resource and likely not going to improve your overall system performance.
If your servers had SSDs instead of "spinning rust", the conversation would be completely different.

I hope this helps to shed some light onto the question you raised.

View solution in original post

Explorer

Wow thats a great answer. We are going to ingest about 1TB/day spread across all indexers so about 200GB/day per indexer.
We are going to retain data for as long as we can. Our amount of ingested dara may grow rapidly so we will need to change retention policy.
We expect a lot of concurrent searches so it is good to know that the CPUs and RAM are not wasted.

We are going to have one splunk instance on each machine, but the dillema now is how to mount the disks.
You said that I/O is our limit, so we need to do everything we can to make it high enough.
We dont want to mount the disks all to a single LVM, because than we may hit OS I/O limit, while we hardly reach the disks limit. However it is hard for us to have several mount points and save specific indexes on each mount, thusly distributing the data across the disks, because our indexes sizes are very dynamic and we have several houndreds indexes, so managing it would be a pain.
Finally we thought about having 2 mount points, one of which we will save hot/warm buckets on, and the other will contain the cold buckets. But we are not 100% sure how to spread the disks among the mounts, because we read in indexes.conf.spec that an open file handle will be saved for each warm bucket unlike cold, so its faster to search, but all over the internet people are using much more space for the cold buckets instead of warm and hot.

It would really help us to know exactly the differences in all aspects between cold and warm buckets, so we know how to mount the disks, or if someone have a better solution for the disk spreading.

0 Karma

Splunk Employee
Splunk Employee

The distinction between HOT/WARM and COLD exists for only one purpose: To allow you to pick the fastest possible disk (and hence the most expensive) for the data you likely search most, which is recent, and pick cheaper mass storage for long term retention (often with relaxed performance requirements). Based on our experience, most environments access data within a 24hr period about >90% of the time, so ensuring that these searches complete as fast as possible is key.

HOT/WARM path is heavily read write, COLD is write once, read many. The only difference between HOT and WARM is that HOT buckets are actively written to, whereas WARM buckets are read-only once they have been created. So: Inbound data->HOT buckets, HOT buckets roll to WARM after a configurable time has elapsed or a configurable number of buckets have been created. WARM buckets roll to COLD after a configurable number or size has been reached. COLD buckets roll to FROZEN after a configurable time has elapsed or size has been reached. See here for more details on that.

If you only have one kind of disk, it doesn't make a difference. I would have one volume for the OS and Application, and another volume for your index data storage. You can create Splunk Volumes to manage space for multiple indices that have the same retention settings as a whole, if you want to. There is no real reason why you can't use multiple disk volumes, but I don't really see the benefit of it either. What if you run out of space on volume1 while volume2 happens to have plenty of space available?

While I understand that you may not have that option, your biggest benefit would be to have a couple of SSDs in each server that you can use for HOT/WARM, and save your spinning disks for COLD storage. If you just had 1TB of SSD, you'd be able to almost keep a full day's worth of logs in HOT/WARM, removing all I/O contention for >90% of your workload.

HTH

0 Karma