Solved: Help with Heavy Forwarder data loss prevention sol...

SplunkExplorer · ‎09-22-2022

Hi Splunkers,

we have a customer with a Splunk Cloud environment.
Every tenant has 1 HF managed by us that sends data to cloud platform and we must manage the HA problem.
Due a Splunk recommendation, we have not the HA implemented in the "usual" form, so we cannot have another (or more) HF s and manage them by a Deployment Server, to implement HA.

Our first solution is a scheduled snapshot that runs every day and, in case of crash of HF server, restore the last working snap. This solution has a big problem: suppose that a crash occurs in the early afternoon and that the restore happen the following morning; this make us make the following question:

What happen to data sent from sources to HF in this time range of HF "death"? Are lost or processed once the HF came back up and running?
In case data are recovered after the forwarder restore, I suppose they are stored in the forwarder queue. Which limits this queue have? What is his size? Will be able to ingest all data or some will be lost?
Suppose that the que is capable to manage all data; the speed of processing depend only by hardware or forwarder have some limits?

Another problem is: in case this solution does not save us by a data loss, considering we cannot have multiple HF, what could be a feasible solution for HA?

isoutamo · ‎09-23-2022

The best practices is send those directly to SC without any gateway/intermediate forwarders.

If there is e.g. security policy which prevent direct sending (FW openings) then you must set couple of gateway forwarders up to send events. to SC. Minimum is 2 per security zone (or what ever is your unit where you are sending logs). UFs are preferred and only if you are needing parsing etc. then use HFs.

If I understand right "windows native protocol" <=> Windows event log etc.? Then you must have at least some HF in some windows machine joined in the same domain and use e.g. Splunk Windows TA here with correctly configured inputs.conf to get events. If I have understood right there are some other tools which you could use on windows client to convert event log to e.g. syslog, but I prefer to use UF instead of those on windows clients.

View solution in original post

maciep · ‎09-22-2022

Been out of the game for a couple years, so I am sure that others here can provide better/updated information, but I'll still give this a quick go.

First, if you can literally only have a single HF, then I don't think you can have H/A. It just doesn't make any sense. You may be able to to try to limit data loss but I don't think you can ever call it highly available. And how much data loss you can avoid I think depends on what roles the HF is playing.

I don't recall the details, but I believe a UF has queuing technology. So if its target isn't available, it can queue the data. How much and how long it can queue, I no longer know but I think it is configurable but limited. So if it's a pretty busy UF (like a domain controller) and/or a long downtime, then likely there will be data loss.

If the HF is a HEC endpoint, then it would be up to the code sending data to it to handle that logic I think. But again, it is unlikely that would be able to wait say 12 hours to send data and likely have data loss again.

If the HF is the source of data itself, then it would depend on where it is collecting data I think. Meaning, if the source has rolled past its last checkpoint then there would be data loss. If not, then it should be able to consume where it left off.

I am curious what that recommendation is from Splunk to only have a single HF as alternate "h/a" strategy.

SplunkExplorer · ‎09-22-2022

Hi @maciep, thanks for your answer.

In a nutshell you confirmed what I was scared about; I opened the post to get confirmation of this.

I'm aware of the "buffer" of UF, but unluckily we have not this features; the data are generated by the Data sources, sent to HF server with native protocol and then it sent them to Cloud Environment. There is not a UF that forward the data to HF; the HF in the middle between Data sources and Splunk Environment is the only forwarder/collector in the environment.

Thanks for you help.

isoutamo · ‎09-23-2022

Hi

In curiosity, what is the reason why you cannot install several HF and add all those to your UFs' outputs.conf? There is no splunk related reason to have several HF (in parallel) between UFs and Splunk Cloud. Actually if you must use HFs then the best practices is that you should use several (in parallel) between UFs and SplunkCloud.

If you are not using any Splunk TAs which needs e.g. python, then the best practices is use several intermediate/gateway UFs between source UFs and SC.

The only situation when there should be only one is that it has some modular inputs like DB Connect which read some events from other systems and then sends those to SC. But in these case you can also have several HFs just be sure that modular inputs have installed (or at least in use) only on one HF at time. Then you need some mechanism how you are replicating checkpoint files/statuses between your HFs if you want to avoid reread modular inputs data on case of recovery.

r. Ismo

SplunkExplorer · ‎09-23-2022

Hi @isoutamo, I think I have been not enough clear/complete in my exposition.

We have not UFs on our environment; the data are sent to HF with native protocol. So, an example of data flow is:

Data source that send data with Syslog, avoiding using UF/Splunk components -> HF collects data and forward -> Data sent to cloud.

About the reason of having not multiple HF I have not details; I know only that this technique has been not encourage, but about this point I have asked more details and further checks.

I appreciate further details you shared, I discovered some elements unknown before. Thanks a lot!

isoutamo · ‎09-23-2022

Thanx for clarifications.

In production I never (if I can avoid) use splunk to collect syslog traffic. Based on protocol (especially with UDP transform) it will always drop some events and you don't know when and how many. The preferred way it to use some real syslog server (HA or single). One option is to use Splunk Connect for Syslog (https://splunk.github.io/splunk-connect-for-syslog/main/).

If you haven't any other option than splunk to collect syslog feed, then it's possible to set up HFs behind load balancer and use those syslog clients to send that traffic to VIP on LB and it forwards that traffic to backends (HFs). That way (when LB is configured properly) it could decrease the event loss when you are e.g. rebooting HFs. You also could run splunk as normal user not as root with LB and clients can still send data to port 514. If you are using just syslog with splunk then maybe (if you don't need to do any parsing etc.) it's better to use UF instead of HF to re

Anyhow the best practices is to use some real syslog server to receive events by syslog feeds.

SplunkExplorer · ‎09-23-2022

Thanks @isoutamo, very appreciated. I will keep note of this.

Can I ask you for a different DS? What about is the source logs are not sent by Syslog but are Windows logs sent with native protocols?

So, if the flow is:

Windows logs sent with native protocols and not using UF -> 1 HF receive logs and forward them -> Data sent to Cloud.

isoutamo · ‎09-23-2022

The best practices is send those directly to SC without any gateway/intermediate forwarders.

If there is e.g. security policy which prevent direct sending (FW openings) then you must set couple of gateway forwarders up to send events. to SC. Minimum is 2 per security zone (or what ever is your unit where you are sending logs). UFs are preferred and only if you are needing parsing etc. then use HFs.

If I understand right "windows native protocol" <=> Windows event log etc.? Then you must have at least some HF in some windows machine joined in the same domain and use e.g. Splunk Windows TA here with correctly configured inputs.conf to get events. If I have understood right there are some other tools which you could use on windows client to convert event log to e.g. syslog, but I prefer to use UF instead of those on windows clients.

SplunkExplorer · ‎09-23-2022

Thanks a lot. Your help is very usefull for us.

Help with Heavy Forwarder data loss prevention solution

configuration

New Case Study Shows the Value of Partnering with Splunk Academic Alliance

How to Monitor Google Kubernetes Engine (GKE)

Index This | How can you make 45 using only 4?