Solved: Indexer and Search Head Hardware Diminishing Retur...

Kieffer87 · ‎10-21-2016

I'm looking through the recommended hardware and talking with Splunk and I haven't gotten a straight answer. I'm hoping some of you with experience can shed some light.

We are going to start with 500GB/day ingestion rate and expect this to grow in the next 12-24 months. Working with Splunk we have landed on 3 physical search heads and 4 indexers. It has been difficult to get a straight answer on if more hardware is beneficial outside of the "more is always better" answer.

Our current hardware plan is to go physical for these items:
(3) Search Heads Clustered (2x14 core @ 2.6GHz, 64GB ram)
(4) Indexers Clustered (2x14 core @ 2.6GHz, 128GB ram)
Hot Bucket: 5TB SSD per indexer
Warm Bucket: 15TB RAID6 (4-5k IOPS) per indexer

So my question is will we really see a benefit of having for example 128GB of ram in our indexers vs the reference spec of 12GB or having 28 core machines vs the 12 (Ent) or 16 (ent-security) reference specs. If Splunk can't or won't actually use the added resources I may dial the specs back a bit.

Also I'm debating if our indexers need 10G vs 1G NIC connectivity. Again Splunks answer was 10G would be nice, a 10G lag would be even better. Does anyone have some real world examples of where it makes sense to bump up to 10G?

lukejadamec · ‎10-21-2016

Here are some real world comments:
1) Manufacturers make high core slow speed processors cheap, and they are slow. Reduce the cores and get 3.5GHz processors with high buss speeds and match the buss speeds with the ram. You'll end up with much better performance even with fewer cores and less RAM - every CPU action will be 30% faster.
2) The separation of Hot and Warm buckets based on SSD or Not might not make sense. A Hot bucket is an active indexing bucket that is searchable that is rolled to Warm once a day or on restart, and a Warm bucket is non-indexing recent searchable bucket that was rolled from Hot.
a. Hot bucket needs fast Read and Write, so SSD makes sense on the surface.
b. Warm bucket needs fast Read speed, so HDD makes sense on the surface.
Real World:
a. With 5TB of SSD (~20 days in your environment) it makes more sense to set Hot and Warm based on primary indexes, i.e. the indexes that will be the targets of the most common searches/alerts/reports in your environment, and use the HDDs for Cold buckets for those indexes.
b. Use the HDDs for Hot, Warm, and Cold buckets for nice to have data indexes.
3) RAID 6 vs RAID 10.

a. RAID 6 will give you more space but you take a big hit on Write performance and lower Read performance. Write = ((Disks = 2) * Speed)/6 and Right = ((Disks +2) * Speed)
b. RAID 10 is as close to SSD as you can get with HDDs, but it takes more disks. Write = Disks * Speed and Read = 2*Disks*Speed
4) Network speeds need to consider the Whole network. You can only get as much at a server as the network infrastructure can send you, and you want to get the data into the indexer as fast as possible. You should look at two things, 1) NIC performance specs and 2) network infrastructure throughput potential.

View solution in original post

googleLogZilla · ‎01-02-2017

@lukejadamec, here's proof:
https://youtu.be/dX_wwCpVURQ
Also, the average log size is around 240 bytes (excluding Java).

googleLogZilla · ‎10-21-2016

At 500GB per day, that works out to a little under 25k events per second.
So why does it take 7 (quite expensive) servers with all of these resources to do what could be done on a $2k PC?
I'll give you a hint: You're doing it wrong 😉

Here's what I used (an older pc):
http://www.newegg.com/Product/Product.aspx?Item=N82E16856101117

It has 32GB ram and Ubuntu 16 with a Samsung EVO NVME being used as a cache drive using lvmcache.
I'm able to ingest at around 40k eps and searches take a few seconds.

Side note:
If you do insist on spending a whole bunch of money for something you don't need, be sure your disks are set up correctly both in raid (10 is best, but 6 is fine). Be sure you have the correct strip size in both the raid bios/config and when you partition the disks in Linux. As well as the mount options you use in /etc/fstab - and be VERY sure you partition on the right sector boundaries (4k by default) - not doing so will drastically reduce performance.

Be sure your OS disks are not the same as your data disks. You will thank me later 😉
I spent a few hours and went from around 300MB/s Read, 76MB/s Write to 1.3GB/s read and 800MB/s write.
Happy to share my notes if you want to PM me.

advt · ‎10-21-2016

I'm definitely interested in finding out how you did that. All I hear is that I spend too much money purchasing hardware to support this software/installation. Can you provide details on how I can do this less expensively?

lukejadamec · ‎10-21-2016

@googlelogzilla it looks like your math is wrong. 25K events/second works out to 207 GB/day at 1 byte/event. We all know events are often larger than 1 byte, and the question referenced 500GB of indexed/searchable events/day.
I respect anyone that has worked anywhere and has succeeded with anything and am also willing to learn from your experience.
I'm expecting that this environment will accept input from X number of systems and will be tasked by Y number of users with search requests for A number of reports, B number of ad hock searches, C number of summary indexed searches/reports, and D number of alerts (based on real time searches).
In my real world you need superior processing power and I/O for both indexers and search heads to achieve this.
As a side note, the number of search threads is dependent on the number of cores, so be sure to balance the search load with core speed.

homerskid · ‎10-21-2016

hi
someone told me about this before (how to do all of this on a pc-level system) and I had the same reaction as the OP. So that makes 2 different people saying they can log at these high rates on a pc. Can you please say how you did this?

Kieffer87 · ‎10-21-2016

I downvoted this post because post goes against all recommendations that i have read on splunk.com and what our splunk rep has shared with us.

Kieffer87 · ‎10-21-2016

Not sure if you posted in the wrong question or if your running this out of your house but I think your recommendation is pretty far off. I'm not about to make the recommendation to buy a $2,000 computer to base what will become the start of an enterprise logging solution for a fortune 100 company. The clusters I have listed above were built with help by our Splunk rep and will be load balanced between our two primary data centers for redundancy and HA. They will also allow us to scale out easily as we add more log sources.

As for the drives, the OS will run on RAID 1 15k SAS drives. Storage will be on enterprise SAN both SSD and SAS 15K.

googleLogZilla · ‎10-21-2016

I've consulted at F100's for 15 years (9 of which were at Cisco Systems) so I am intimately familiar with these scenarios.
redundancy and HA:
Ok, so use 4x$2k computers - joking a bit there, but sure, you can do it just fine at a fraction of the cost.
Just because you have not seen this done doesn't mean it can't be.
The world wasn't always round, ya know?

richgalloway · ‎10-21-2016

Ingestion rate is only part of the puzzle. Your indexers will also be performing searches. The more searches (including accelerated data models, scheduled reports, and alerts) you run, the more CPU you will need on the indexers. Depending on the nature of those searches, you may find you need more indexers to better distribute the work and produce results quickly.

The story is similar for connectivity. A 1G NIC may be enough for ingestion most of the time (have you considered burst rates?), but the indexers will also be transferring bundles back and forth to the search heads so be sure to allow for that. My customer, with a lower ingestion rate, recently decided to go to 10G NICs. I suggest you do the same.

---
If this reply helps you, Karma would be appreciated.

Kieffer87 · ‎10-21-2016

Thanks for the NIC input, I will pass that along to my server team.

lukejadamec · ‎10-21-2016

Here are some real world comments:
1) Manufacturers make high core slow speed processors cheap, and they are slow. Reduce the cores and get 3.5GHz processors with high buss speeds and match the buss speeds with the ram. You'll end up with much better performance even with fewer cores and less RAM - every CPU action will be 30% faster.
2) The separation of Hot and Warm buckets based on SSD or Not might not make sense. A Hot bucket is an active indexing bucket that is searchable that is rolled to Warm once a day or on restart, and a Warm bucket is non-indexing recent searchable bucket that was rolled from Hot.
a. Hot bucket needs fast Read and Write, so SSD makes sense on the surface.
b. Warm bucket needs fast Read speed, so HDD makes sense on the surface.
Real World:
a. With 5TB of SSD (~20 days in your environment) it makes more sense to set Hot and Warm based on primary indexes, i.e. the indexes that will be the targets of the most common searches/alerts/reports in your environment, and use the HDDs for Cold buckets for those indexes.
b. Use the HDDs for Hot, Warm, and Cold buckets for nice to have data indexes.
3) RAID 6 vs RAID 10.

a. RAID 6 will give you more space but you take a big hit on Write performance and lower Read performance. Write = ((Disks = 2) * Speed)/6 and Right = ((Disks +2) * Speed)
b. RAID 10 is as close to SSD as you can get with HDDs, but it takes more disks. Write = Disks * Speed and Read = 2*Disks*Speed
4) Network speeds need to consider the Whole network. You can only get as much at a server as the network infrastructure can send you, and you want to get the data into the indexer as fast as possible. You should look at two things, 1) NIC performance specs and 2) network infrastructure throughput potential.

Kieffer87 · ‎10-21-2016

Thanks for all the input.

I should have mentioned we will then have 135TB in Hadoop or another cheaper storage medium for cold storage. We need to keep the logs searchable for 12 months due to needs by our audit team.

Good advice, I will see what our server team comes back with for prices. The E5-2690v4 (2.6GHz/14core) was actually a bit more expensive than the E5-2667v4 (3.2GHz/8core) so I could save some money there and gain some performance. We were also planning on running DDR4 2400 memory which should be the max these processors support.
Setting the most searched indexes on SSD makes perfect sense, thanks for that input.
Unfortunately I can't do much on the RAID6 as that's what our enterprise SAN system is based on. We will be spread across 60-70 spindles with IOPS in the 4-5K range so I'm hoping that will offset the slower read and write speeds of RAID6, fingers crossed.

Indexer and Search Head Hardware Diminishing Returns

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?