Getting Data In

HAProxy server client IP

Nrsch
Explorer

I have a serious problem, please help me.   We have an HAProxy server that receives logs from various clients and forwards them to a Splunk Heavy Forwarder. The problem is that HAProxy replaces the client's IP address with its own (in TCP). The question is: how can we have the client's IP address for each log in the Splunk Heavy Forwarder?

Labels (1)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

If you refer to the value of the host field assigned to an event when connection_host=ip (and only if it's not overwritten later by transforms), then no - you cannot do that directly within Splunk.

As HAProxy works as a "middle-man" - it is the originator of all your logging TCP connections. It receives events from the remote hosts and then sends all events to your HF within a connection initiated by itself. So obviously the origin of the event is lost.

This is one of the reasons why you should _not_ receive syslogs directly on Splunk.

Ideally, you should replace your haproxy with a syslog receiver which would track the source addresses and either write events to files to be picked up by a forwarder or forward them to HEC.

Nrsch
Explorer

Thank you for your answer.

We are using HAProxy as a load balancer because we want to have two Heavy Forwarders, so if one fails, the other remains active.

I have researched and found that the PROXY protocol in HAProxy adds a header containing the client's IP address. However, it seems that Splunk Heavy Forwarder does not natively support or understand this header.

As you mentioned, does this mean there is no reliable way to use HAProxy as a load balancer and still have access to the original client IP in the Splunk Heavy Forwarder?

Also, I have one more question:
Is it true that the log format of each client (when HAProxy is acting as a middle-man sending logs to the HF) may be different, depending on the client source?

Thank you very much for your help.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Ok. Several things.

1. Unless you're working with a syslog-aware solution, loadbalancing syslogs usually doesn't end well. Having said that - I'm not aware whether modern haproxy "can syslog" or cannot. Just never tried to use it for this purpose.

2. Whatever you do, with such a setup you're always introducing another explicit hop in your network path between syslog source and your syslog receiver(s). Some solutions (rsyslog for sure, not sure about syslog-ng) can spoof source address but that works only for UDP syslog and can lead to network-level problems especially when return route doesn't match the supposed datagram origin. With TCP you simply cannot spoof source address because return packets would go to the original source, not to the LB.

3. There is no single ‘syslog format" so each of your sources can send data in a different form. There are even some solutions which send differently formatted events depending on what subsystem those events come from.

4. There is no concept of "headers" in syslog. Proxies can add own headers but that usually applies to HTTP.

5. While the idea of having a LB component for HA seems sound there is one inherent flaw in this reasoning - the LB becomes your SPOF. And in your case - it adds a host of new problems without really solving old ones. If you really want to have a highly available syslog receiving solution you'd need something that:

- can understand syslog and can process each event independently, can buffer events in case of network/receiver problems and so on

- can be installed in a highly available 1+1 setup

Additionally you might have problems with stuff like health checks for downstream receivers if you try to send plain TCP data.

A general-use network-level load balancer doesn't meet the first requirement and typical open source syslog server on its own doesn't meet the second one (with a lot of fiddling with third party clustering tools you can get a pair of syslogs running with a floating IP but then you're introducing a new layer of maintenance headaches).

So typically with syslog receiving you want to have a small syslog receiver as close to the sources and possible and as robust as possible. You don't want to send straight to HFs or indexers. Receiving syslogs directly on Splunk has its performance limitations and is difficult to maintain.

Nrsch
Explorer

Thank you for your answer.
Since the HAProxy collects logs from different clients, their log formats are different.
So, do we need to parse the logs twice — once to extract the hostname, and again to extract other fields depending on the log type?
Also, is it possible to extract the host IP address instead of the hostname?

Thank you very much for your help.

0 Karma

livehybrid
Ultra Champion

Hi @Nrsch 

Are you talking about the "host" field in Splunk? It is typical for this field to be the device which is sending the logs. Instead you would want to extract a field called something like "src_ip" or "client_ip" which would be the IP address of the client system making the web request.

If you're able to share a few sample/redacted events then I'd be happy to help create the relevant extractions you need.

There is also a Splunkbase app for HAProxy (https://splunkbase.splunk.com/app/3135) which is designed to take a syslog input however the field extractions could well be the same if you're sending to a file and then forwarding with a Splunk forwarder? 

Alternatively you could look to set a custom HAProxy log format (since you wouldnt be using the off-the-shelf addon) and can then set key=value pairs for the log event components, e.g. client_ip=%ci for client IP. See https://www.haproxy.com/blog/haproxy-log-customization for more info on that.

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Nrsch
Explorer

You're very welcome for the help. I believe both the Splunk Base app and the log format you referred to are related to HAProxy's internal logs. However, what I'm looking for is a method to capture the IP addresses of external clients connecting through HAProxy.

In fact we have an HAProxy server that receives logs from various clients and forwards them to a Splunk Heavy Forwarder. Each client have its own log format. The problem is that HAProxy replaces the client's IP address with its own (in TCP). The question is: how can we have the client's IP address for each log in the Splunk Heavy Forwarder?

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @Nrsch ,

if you have different log formats, you should exactly identify each data source assigning to each one the correct sourcetype.

Then, having the correct sourcetype, you can define an host recognition regex for each one.

About IP address instead of hostname, it depends on the presence of this field in the log: if it's present, you can assign it to the host field using a regex, if not present, it isn't so easy.

Ciao.

Giuseppe

gcusello
SplunkTrust
SplunkTrust

Hi @Nrsch ,

if the hostnames that you would to use in the host field are in the logs, you can override this value using a regex on the Heavy Forwarder following the instructions at:

https://docs.splunk.com/Documentation/Splunk/9.4.2/Data/Overridedefaulthostassignments

in props.conf:

[<your_sourcetype>]
TRANSFORMS-override_host = override_host

in transforms.conf:

[override_host]
REGEX = <your_regex>
FORMAT = host::$1
DEST_KEY = MetaData:Host

beware in defining your regex that you must use in the FORMAT option the group containing the hostname.

Ciao.

Giuseppe

Get Updates on the Splunk Community!

Splunk ITSI & Correlated Network Visibility

  Now On Demand   Take Your Network Visibility to the Next Level In today’s complex IT environments, ...

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

  Now On Demand  Stay ahead of today’s evolving threats with the combined power of the Splunk Threat Research ...

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics

Automated Archival is a new capability within Metrics Management; which is a robust usage & cost optimization ...