Deployment Architecture

Discrepancies between DMC server view and actual server view

Abass42
Communicator

I had a few questions over how the DMC gathered its information for server specifications and how to extract them. 

I am trying to add resources to our slower indexers. I am trying to follow the official resource docs here

To begin, I am viewing my infrastructure through the DMC view:

Monitoring Console -> Settings -> General Setup

 

And from here, I get a view that looks like:

Abass42_0-1747751154751.png

In this view, my server is said to have 4 Cores and 15884 MB worth of memory. These are Azure VMs, so they use vCPUs. This one specifically is an Azure VM indexer. We have latency issues with these and I believe it's due to it being under resourced. 

 

Looking at my server specifically, I get the following CPU  specs:

Abass42_1-1747751436488.png

 

From looking at the other hosts, from both sources, I have reached the conclusion that cores = cores x sockets (this makes sense)

From some references I was looking at, it seemed like i needed to be looking at the CPU(s) value from the lscpu command. For this example, this server has 8 vCPU, and its recommended we have at least 12.

 When upgrading, do i need to focus on the cores, or do i just need to specify how many vCPUs i need.

I also wanted to know how I could extract this view from the DMC. I want to table all of the resource metrics so I can gather to get a full picture of what I have in my Splunk environment. 

Thank you for any guidance or insight. 

Labels (4)
Tags (1)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

No. You have 8 vCPUs whereas the minimum indexer specs show 24vCPUs.

Actually it's a bit more complicated than that. Since most if not all modern CPUs include hyper-threading, OS shows more "CPUs" than the die actually has cores. It's however not as easy to calculate performance of such setup since hyper-threading works pretty good with multithreaded applications but doesn't give that much of a performance boost with many single-threaded ones.

Anyway, you most probably have a virtualized 4-core 2 threads-per-core CPU which is a really low end for a production indexer. Yes, you can run Splunk lab at home on 4 cores but if you want to put any reasonable load on the box - you'll have a lot of problems. As a rule of thumb Splunk uses 4-6 vCPUs just for single pipeline indexing. So there's not much juice left in your hardware for searching. This box is really undersized.

Abass42
Communicator

Yes it is. I wasn't around when these were spun up, but now that it's up to me to fix it, I want to make sure we wont run into these issues for a few years.  If we already have 10 Physical indexers that handle most of the data, would I need the 96 vCPU for the other 6 in Azure, I have to consider costs with this as well. 

 

Thanks 

0 Karma

isoutamo
SplunkTrust
SplunkTrust

You said that indexers are slow, but how you have made this conclusion? And did this mean that there is lack of cpu, memory or IO resources? Before you start to increase size of servers you must understand what exactly is your situation! 

  • How much you are indexing daily
  • how many queries you have
  • how many indexers you have
  • which kind of topology you have 
  • have you SmartStore in use
  • which kind of nodes you have
  • what node types you have
  • what storage you have
  • what are you metrics for cpu, mem, io
  • how many indexes
  • etc
0 Karma

Abass42
Communicator

Hey, 

Thanks for your response. With this azure cluster, we receive latency alerts all of the time. Comparing the installed and available resources to the recommended, we are under resourced. We have a clustered environment, two clusters at a different location for both the Search Heads as well as the Indexers. We have 10 Physical indexers, each with about 60TB of storage and 48 cores and 96 CPUs. Compare these indexers to the Azure ones, which have 4 Cores and 8 CPUs.

 

Our Azure cluster has been giving us latency warning for a few years now, even after a few upgrades. Now that I am more comfortable with our environment, I want to finally upgrade the cpu and memory to the recommended values. 

 

We have 6 Azure Indexers, all of which have latency issues at some point or another. We have about 100 indexes, our top 3 sources ingest 600GB daily, and we average about 1.7 TB a day. 

 

To sum up, these are under resourced, and they need more CPU. 

 

Thanks

0 Karma

kearaspoor
SplunkTrust
SplunkTrust

Hi!   @isoutamo looped me in b/c he knows I'm currently in an Azure environment that's doing ~1Pb/day

First, are you using SmartStore to offload older events to blob storage?  If so, around 1TB/day you're going to want to start thinking about splitting up your cluster because Azure throttles blob upload/download.  That WILL cause latency problems.  And there's also a whole bunch of SmartStore tuning you'll need to consider to minimize cache thrashing.   If you're not using SmartStore then the math goes a completely different way.

Generally, what instance types are you using?  We've evaluated the following and find them are more than capable at our scale
dasv5-series 
dasv6-series 
lsv3-series 
ebsv5-series 
edsv5-series 
edsv6-series 
If you ARE using SmartStore keep in mind that theres no concept of hot/cold,  just local disk/remote store so some of the faster local NVME may not scale up to what you need for your local cache and going for systems that don't have that and instead can scale your attached disk for your local cache is the way to go. That was our situation which is why we chose instances that don't have local disk, but allow lots of disks to be attached.

If you AREN'T using SmartStore, then you'll want to look at the other instance types and leverage the NVME local disk for hot/warm and teh attached disk as cold.

Beyond that, it's just a matter of picking the right size of your instance types to meet your SF/RF needs and data Ingest/Search load.   SmartStore/blob storage is really the piece that makes Azure unique.  Let me know if you are using it and we can discuss how to go about splitting your storage account(s) and possibly splitting your cluster.

Abass42
Communicator

Hey, 

Thank you for assisting. 1PB is incredible. I though 1.7 TB was a lot. I am not too sure about instance type, I am reaching out to see. As far as Smart Store, we aren't using anything of the sort i believe. We just have retention policy to roll over cold/frozen data. Ill look over the docs regarding the smart store. 

 

In regard to min requirements, a lot of our administration servers, Deployment Server, some of the servers handling syslog data, DMZ heavy forwarders, Cluster Managers, etc, all have around 6-8 Cores, and roughly around 6 CPUs. 

The Server pictured below/above, is of our Azure  Cluster Manager. It manages a cluster, and may index itself, not sure, but it only has 8 CPU and 4 cores. Should all servers at least meet the minimum requirements? Especially with our ingestion load? I would imagine so. I can work with Support to answer any specific questions as I already have an ODS case open to handle this Splunk version upgrade. This RHEL upgrade is being pushed due to RHEL 7's support expiring. 

  • 12 physical CPU cores, or 24 vCPU at 2 GHz or greater speed per core.
  • 12 GB RAM.
  • A 1 Gb Ethernet NIC, optional second NIC for a management network.

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

For the "auxiliary" servers (although CM is very important for cluster operations) the sizing hugely depends on a scale. You can have a TB-sized environment which still serves only a few dozens of UFs from DS so you can make this DS really small (6CPU would suffice; I've seen such environments) but you could as well have several thousands of UFs pulling from DS. Anyway, with DS you can significantly lower the server's load by increasing the polling period at the cost of increased "latency" of changes to deployed apps.

CM also grows with the size of your environment. TB/day scale is still relatively moderate so it shouldn't need 24vCPUs for that. 

0 Karma

isoutamo
SplunkTrust
SplunkTrust

Based on that information your cluster sizing is not even a near the enough 😬 I’m quite sure that your environment needs something else than cpu too.

In Azure there are some other limitations and recommendations which you must handle with your current volumes. Let’s see if we can get some people who have worked more with bigger Azure splunk Installations?

Abass42
Communicator

I am , 

 

I am working alongside the unix team, as they have a better understanding of storage and resource requirements than I do. But this upgrade is long overdue. 

Thanks for the assistance. I think I have enough for the required requests I am filling out. 

0 Karma
Get Updates on the Splunk Community!

Splunk Observability Cloud's AI Assistant in Action Series: Auditing Compliance and ...

This is the third post in the Splunk Observability Cloud’s AI Assistant in Action series that digs into how to ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

What You Read The Most: Splunk Lantern’s Most Popular Articles!

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...