Syslog Servers in RedHat Cluster with Splunk Unive...

pranitprakash · ‎07-20-2017

We are planning to implement Red Hat Cluster (RHCS) for Syslog servers. They will be in active-passive controlled through heart-beat and will have Universal Forwarders installed.
There are two options that we need to select from:

Shared SAN storage between these two servers: In case primary goes down, the syslog demon is started in secondary automatically. We can auto start on Splunk service on secondary syslog in case primary server is not available.
Separate Storage on each of the servers and Splunk service will be running on both. In case primary goes down, secondary syslog server will collect the logs and then transmit data to Indexer through heavy forwarders.

Which of them can be the better way?
I assume many of us would have implemented Syslog with failover to send data to Splunk. Please advice.

s2_splunk · ‎07-21-2017

May I ask why you are going the active-passive route with the added complexity of RHCS instead of just using two standalone syslog servers behind a load-balancer? That would be the best practice architecture for syslog data collection and provides you an active-active solution with twice the capacity and no increased availability risk.

Also, why do you have Heavy Forwarders in the architecture instead of sending directly from the universal forwarders installed on the syslog servers?

If you are set on using RHCS, go with option 2 since it removes another single point of failure (Shared SAN) with unpredictable storage performance. And if you are doing that, read my first question again... 😉

cpetterborg · ‎07-21-2017

Active-active is definitely superior because of the loss of data created by lag time from going from Passive to Active. UF's are three times faster at processing than HFs, so I agree with ssievert on that point, too.

Unless you are tied into using the RHCS and HFs for some architectural motives, really think about changing your design. There are faster and more reliable solutions you could go with.

pranitprakash · ‎07-23-2017

Thank you to both of you. My response on two questions:
1. Your question of using RHCS? We are using RHCS and not external LB due to extra cost of an external load balancer. And so focusing on RHCS for active-active/active-passive. Active-passive was the suggested approach as we can use service on only on server (say primary) and then switch to secondary only when primary goes down.
Is there another possible approach without external LB and RHCS?
2. your question on why HF? We have HF as next layer and working as gateway. On Syslog servers, we intend to use the UFs and these will pass the logs to next layer of HF before reaching to Indexers.

Alright, I will focus on the active-active configuration using RHCS. Please let me know what could be another approach without RHCS and external LB.

s2_splunk · ‎07-24-2017

An alternative solution could be to use DNS round-robin list with two entries and a very short TTL, I guess. Do you not have any external load balancers in your environment already that could be configured to host a VIP address for your syslog server pair? You don't need a dedicated one...

Make sure your 'gateway' forwarder does not present a choke-point for events flowing to your indexers and affect your event distribution across your indexers (affects search performance and potentially data retention). You should have twice as many intermediary forwarders (or forwarder pipelines) as you have indexers.
If you don't have any network constraints that prevent direct UF-Indexer connections, and you don't have any filtering/routing/xxx requirements that mandate a gateway/intermediary forwarder, consider removing it from your architecture. You'll lead a happier life... 😉

cpetterborg · ‎07-24-2017

I'm not sure I understand the sentence:

You should have twice as many intermediary forwarders (or forwarder pipelines) as you have indexers.

Are you saying that if you have 10 indexers, then you must have > 20 UF's or HF's sending data to the indexers for the syslog data? I don't think that is what you are saying, but that is what I'm reading. Sorry to ask, but I'm just confused by the sentence.

s2_splunk · ‎07-24-2017

If I assume that those two syslog servers are just two of a much larger number of forwarders that send data to indexers via an intermediary forwarder, then yes, that is exactly what I am saying. Make sure you have at least 2x intermediary forwarding pipelines. And note that those don't have to be individual servers (see here).

A given forwarder (intermediary or not), only talks to a single indexer at any given point in time (unless you have configured multiple forwarding pipelines). If you funnel - for example - 300 forwarders through a single intermediary forwarder, all events will go to a single indexer for however long it takes the intermediary to switch indexers. This will negatively affect event distribution across your indexing tier, which will negatively affect your search performance, especially for searches across recent time windows. This is because not all your indexers (=search peers) will participate relatively equally and in parallel in satisfying a search request.

The same will be true for the syslog data. The syslog servers are already a concentration point for events, given that a larger number of network devices and other systems feed there events to it. If you don't ensure via proper architecture and/or proper configuration that the forwarding data stream is (ideally) evenly spread across your available indexers, you will likely experience issues, either with less than ideal search performance and/or with premature data ageing due to some indexers having to index and store more data than others.
Small-ish differences in event counts are normal, but "sticky forwarders" that don't switch their indexer connections regularly, or not enough intermediary forwarders will cause issues that cannot be corrected easily once they manifest themselves.

I will go to great lengths with my customers to try and dissuade them from intermediary forwarding tiers for those reasons. There are use cases and requirements where you have no choice, but more often than not they are not really needed, can introduce a number of issues, create another point of failure and result in your Splunk configuration to be distributed between indexers and intermediaries, when you could have everything in one place.

So, if you need them: Have enough of them to not create a funnel that forces an event stream from potentially thousands of log source systems to go through a very small number of intermediary forwarders and ensure they are configured properly (forceTimebasedAutoLB=true, AutoLBFrequency=)

I hope that makes more sense now.

cpetterborg · ‎07-21-2017

Unless the entire Splunk Forwarder directory structure (the Splunk application) is shared on the shared SAN area on your syslog servers, you will have fish bucket problems using option #1. Even with shared forwarder space you could have potential problems. I would opt for #2 if they are your only two solutions.

There are other (better) solutions to the problem, but if these are your only two because there are design criteria that you must adhere to, I suggest sticking with #2. It is possible to implement this in an HA design that doesn't even need to use disk space for the collection for a forwarder to use. I currently have a non-HA implementation that can handle over 1TB of syslog data on a single syslog server, and doesn't use any disk (which is a particularly bad bottleneck when using a SAN). It is currently handling over 800GB on an 8 CPU VM, and the load average is 0.00 (yes, that is not a typo) with the CP usage at about 50%. We don't need it HA, so I didn't opt for a second machine.

pranitprakash · ‎07-23-2017

Interesting. How did you design a failover without HA in your non-production set-up? Can you elaborate more? Thanks!

cpetterborg · ‎07-23-2017

As I said, we didn't need HA, so in our case there is no failover. If there is a loss, we are okay with it, though we hope not to lose any data. So far we have only had one short outage. And we don't have a non-prod setup. So I'm not sure how to answer.

But, that being said, you can use a load balancer in front of two syslog servers that are both configured identically to forward the data into Splunk.

Our Splunk SE is planning on doing a Splunk Blog posting about the architecture because it is so efficient and simplified. About the highest throughput that you can get with a single UF is about 800GB/day. I'm pretty sure you could push at least double that with this architecture because there is no disk involved between the sender and the indexer. The one downside is that the host assignment comes at index time (though you could push it to search time, though the assignment seems to be hardly any additional load on the indexers) through transforms.conf. If you are interested in the architecture, I can post the link to the Blog post when we have it published.

cpetterborg · ‎07-24-2017

We are currently using a load balancer in the architecture we have set up, but it isn't an F5 or anything expensive. It's a VM with a couple of CPUs only. It is using less than 100% of a single CPU. So if you are worried about the expense of an LB, there really is hardly any. Yes, that could be the single point of failing if you go that route, but they are ways to reduce that risk.

o_calmels · ‎07-21-2017

Hi, have you think about directly send events from your 2 sylog servers to the splunk indexer or a unique forwarder (syslog output configuration, ie: RedHat Rsyslog config).
This architecture does'nt need any splunk install on your syslog cluster and so, you can manage your storage like you want

Syslog Servers in RedHat Cluster with Splunk Universal Forwarder/s

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life