Getting Data In

Syslog Server vs Heavy Forwarder

archme
Explorer

Hi

I wanted to get some opinion in these 2 scenarios:

Scenario 1:

A collection of UF + Network devices sending via syslog -> Consolidated at 1 HF -> IDX

Scenario 2:

A collection of UF sends directly to IDX
Network devices sending via syslog -> Consolidated at 1 Syslog server with UF installed -> IDX

In Scenario 1, if there is a connectivity issue between HF and IDX, will there be data loss? I understand we can enable persistent queues as well which will store the logs temporarily in memory. This means the HF will require high memory to cater for data loss incase of network failure between HF and IDX?

In scenario 2, there is storage space allocated inside the syslog server which can cater for storing the logs temporarily in the event of data loss. However, in this scenario, how will the events from checkpoint/dbconnect be catered for? It still requires HF and will fall under scenario 1? Multiple firewall rules need to be allowed from all the UF installed servers to the IDX. The consolidation of logs from all the UF will not happen in this scenario right (incase of collecting from remote sites)?

Question outside of scenario 1 and scenario 2:

Can HF be installed on the syslog server and be used to collect the logs stored by the syslog server and push it to idx? What happens if there is network failure? will the HF have a marker that says where it stopped collecting the logs from and will resume accordingly when the connectivity comes back up? will there be data loss?

Thanks in advance

0 Karma

xpac
SplunkTrust
SplunkTrust

The recommend best practice for your case and pretty much all syslog setups is to send the syslogs to a central server, where a syslog daemon (preferrably syslog-ng, or rsyslog) collects all data, and writes it to disk, split by hostname, date, etc.

You can then have a UF (or HF, if necessary) monitor those files and forward them to an indexer.
This is best practice for a bunch of reasons:
- Less prone to data loss, because syslog daemons are less frequently restarted, and restart much faster than Splunk
- Less prone to data loss, because all data is immediatly written to disk, where it can be picked up by Splunk
- better performance, because syslog daemons are better at ingesting syslog data than Splunk TCP/UDP inputs

In this case, a network failure between UF/HF and IDX won't cause any problems, because the data is still being written to disk - UF/HF will just continue to forward it when the connection is available again. I'd prefer this over Splunk persistent queues, because it also allows for much easier troubleshooting - because the data is written to disk, and you can look at it. If something is wrong, you can check if you got bad data, or if something went wrong after that with your Splunk parsing.

You can find more details in the .conf 2017 recording/slides for "The Critical Syslog Tricks That No One Seems to Know About", or in the blog posts Using Syslog-ng with Splunk and Splunk Success with Syslog.

Hope that helps - if it does I'd be happy if you would upvote/accept this answer, so others could profit from it. 🙂

FrankVl
Ultra Champion

Scenario 1 is killing for your data distribution across your indexers. The HF will only be sending to 1 (or 2 if you enable a extra parallel pipeline) IDX at any given moment. This means indexing load is very bursty on the indexers and it also reduces search performance if data is not distributed nicely.

In general for syslog sources it is indeed recommended to use a syslog server to receive the logs, write them to file and then have a forwarder (UF or HF depending on your needs in the sense of processing) read them and send them to IDX. Forwarder will indeed keep track of which part of each file it has already sent and in that way the files created by the syslog server provide a cache to cover for any downtime of the forwarding. Note: if you have a lot of data coming through syslog, you may want to consider setting up multiple of those syslog servers behind a loadbalancer (or some other way of distributing various syslog sources over them). This improves resilience, but also again improves data distribution. You want to create as little choke points as possible between sources and indexers.

One comment on persistent queues: those are on disk, not in memory. The normal queues are in memory. Note: in scenario 1, adding persistent queues on that 1 HF helps if your IDX cluster is not reachable, but doesn't help anything for a breakdown (or temp. downtime due to maintenance) of the HF itself. That 1 HF is a single point of failure which you would probably want to avoid.

So I would say: aim at scenario 2. If you need a HF for certain purposes like dbconnect, you can just set up a HF on its own to handle those datafeeds. Completely independently from the UFs and the system(s) processing the syslog data.

jlvix1
Communicator

Scenario 1 is more practical and less troublesome, it's worth doing that if you end up with your logs looking nice at the indexer with the correct source host.

Keep in mind that there is a distinct difference between UDP and TCP when using syslog, you are more likely to lose data with UDP as it's black-holed if the endpoint isn't available. Always better to use 2xHF's with load balancing.

You could stick to using TCP throughout - you can get a forwarder to index a copy of events locally and then forward on, that's always an option, but you need an enterprise license, I personally would keep a distance from that idea.

With syslog you can't really avoid losing events if you need to make changes to the handlers!

I specialise in the syslog area in general, I've dealt with massive scale systems that were a mixture of scenario 1 & 2 & others - We actually ended up consolidating everything to load balancers which then forwarded to 4 x RHEL rsyslog instances (like scenario 2), dropped to disk for safety & compliance, but then relayed instantly to HF's and an external SOC - to preserve the source host header mostly. There will always be the problem of determining what is missing and then manually re-loading the events in that were missed, but we decided this would only be the case when a large window of data was missing or to supplement an incident investigation.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...