In our organization our apache log files are of type access_combined with the exception of the host field being replaced with the value(s) from the x-forwarded-for field because of the use of load balancers and other caching mechanisms.
This creates a situation where the host field end up looking like:
xx.xx.xx.xx or
xx.xx.xx.xx, xx.xx.xx.xx or
xx.xx.xx.xx, xx.xx.xx.xx, xx.xx.xx.xx etc
I have seen log entries with as many as 5 host ip's in the x-forwarded-for field. Can someone explain the process required to have splunk correctly index the access logs given this variability in the log entries?
I have a solution that works pretty well, at least in our environments. I haven't tested it thoroughly against IPv6 addresses, but the few "fake" ones I am getting look to come through.
Note, in our environments we have the ClientIP field for the webserver replaced by the X-Forwarded-For IP list if we get that header. So we took the default access-extractions REGEX from $SPLUNK_HOME/etc/system/default/transforms.conf, took out the [[nspaces:clientip]], and replaced it with the following:
(?<all_xff_ip>(([.\d]+|[a-fA-f0-9\:\.]+|-|localhost)(?:,\s)?)+)
Then we have a second transform to break the individual IPs out as needed.
#A multivalue definition to capture all the xff ips
[mv_xff_ip]
SOURCE_KEY=all_xff_ip
REGEX = (?P
MV_ADD = true
This answer pointed me in the right direction but it was missing a piece of the puzzle - changes to props.conf. Here are all the configuration changes that had to be made to turn clientip into a multivalued field and correctly parse the X_Forwarded_For IPs:
transforms.conf - modified. The new regex is highlighted
[access-extractions]
REGEX = ^**(?<all_xff_ip>(([.\d]+|[a-fA-f0-9\:\.]+|-|localhost)(?:,\s)?)+)**\s++[[nspaces:ident]]\s++[[nspaces:user]]\s++[[sbstring:req_time]]\s++[[access-request]]\s++[[nspaces:status]]\s++[[nspaces:bytes]](?:\s++"(?<referer>[[bc_domain:referer_]]?+[^"]*+)"(?:\s++[[qstring:useragent]](?:\s++[[qstring:cookie]])?+)?+)?[[all:other]]
# new section
[clientip]
SOURCE_KEY=all_xff_ip
REGEX = (?P<clientip>[.:\d]+|[a-fA-f0-9\:.]+|-|localhost)
MV_ADD = true
props.conf
# new section
[access_combined]
REPORT-access_combined_clientip = clientip
There is more than likely a much better way to do this... But here is how I wound up solving it...
The actual access log entry will look like
IP1, IP2, IP3 - - [time] "GET url ..." ...
right?
I use the [ as an anchor will multiple regex's:
rex field=_raw "^(?.*)\s+-\s+-\s+["
and then another rex to parse the rest of the line (of course the two - have meaning and you may want to pull those in as variables as well). Now, you can do that in props.conf and transforms.conf by having multiple REGEX lines in props.conf calling multiple transforms stanzas.
Use regex to override the field extraction and set the correct value, depending on the number of IPs in the different kind of logs....
Same issue for me with the X-Forwared-For in the logs.