Splunk Search

Do we we have to write a custom transform for our Apache combined access log format for proper field extraction?

Explorer

We have the below Apache log format on our apache conf

LogFormat "%{True-Client-IP}i %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D \"%{x-wily-servlet}o\""

This is logged as:

24.96.82.143 24.143.197.191 - - [30/Mar/2015:13:03:45 -0400] "GET /AST/Main/Belk_Primary/PRD~99999998368WACO/Wacoal+Wacoal+Embrace+Lace+Collection.jsp?navPath=Wacoal&boutiquePage=true&ZZ%3C%3EtP=4294948624&ZZ_OPT=Y&PRODUCT%3C%3Eprd_id=845524442450490&FOLDER%3C%3Efolder_id=2534374302087929&bmUID=kNFqGZg&ViewAll=&changeViewInd=y HTTP/1.1" 200 42823 "http://www.belk.com/AST/Boutiques/Boutiques_Primary/Wacoal.jsp" "Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko/20100101 Firefox/22.0" 168176140 "Clear appServerIp=74.213.129.193&agentName=WSPRD08F&servletName=__belk_outfit_detail&servletResponseTime=168097&agentHost=belkecaprd20&agentProcess=WebLogic"

Looking at the default extraction for access on transforms does not match our format. Does this mean we have to write a custom transform for our log? Please confirm.

REGEX = ^[[nspaces:clientip]]\s++[[nspaces:ident]]\s++[[nspaces:user]]\s++[[sbstring:req_time]]\s++[[access-request]]\s++[[nspaces:status]]\s++[[nspaces:bytes]](?:\s++"(?[[bc_domain:referer_]]?+[^"]*+)"(?:\s++[[qstring:useragent]](?:\s++[[qstring:cookie]])?+)?+)?[[all:other]]
1 Solution

SplunkTrust
SplunkTrust

Probably yes. If you are changing the format of an event that Splunk does extraction on via regex then you should expect to have to make your own regex. Because face it, you're not really using "Apache Combined Log Format" anymore, you're using "aruncse83 Apache-Combined-Like Log Format".

There is one way I have done this before and have good success is by only adding things to the END of the Apache Combined format - and then, adding those things strictly as key=value items. In this way, all of the regex stuff still matches the standard Apache format, and key=value data is extracted just fine by Splunk's default KV-extract code. You get the benefits of your custom format without any of the pain associated.

View solution in original post

SplunkTrust
SplunkTrust

Probably yes. If you are changing the format of an event that Splunk does extraction on via regex then you should expect to have to make your own regex. Because face it, you're not really using "Apache Combined Log Format" anymore, you're using "aruncse83 Apache-Combined-Like Log Format".

There is one way I have done this before and have good success is by only adding things to the END of the Apache Combined format - and then, adding those things strictly as key=value items. In this way, all of the regex stuff still matches the standard Apache format, and key=value data is extracted just fine by Splunk's default KV-extract code. You get the benefits of your custom format without any of the pain associated.

View solution in original post

Splunk Employee
Splunk Employee

Lets face it, this is a great answer

Explorer

Thank you dwaddle for the above reply... This is exactly what I did just after posting the question... So I changed the regex to match the additional field which is logged on our apache... which is ^[[nspaces:clientip]]\s++

 REGEX = ^[[nspaces:clientip]]\s++^[[nspaces:clientip]]\s++[[nspaces:ident]]\s++[[nspaces:user]]\s++[[sbstring:req_time]]\s++[[access-request]]\s++[[nspaces:status]]\s++[[nspaces:bytes]](?:\s++"(?[[bc_domain:referer_]]?+[^"]*+)"(?:\s++[[qstring:useragent]](?:\s++[[qstring:cookie]])?+)?+)?[[all:other]]

so that fixed the problem with the default extraction... I agree with you on the recommendation to move all the custom fields to the last in key value format, ( that is standard norms) we should probably do this some time later. At this point it is easy for me to make the change at splunk side and extract these, rather than adjusting the web server which warrants additional paper work...

SplunkTrust
SplunkTrust

Glad it worked for you. One thing I would check though is that you're extracting into a field named clientip twice in this case. Do you mean to do it like that? IF you do that's probably fine and I would expect clientip to become multivalued. Contextually it's kinda weird to me, but you know your data best.

Explorer

It is actually remote ip

0 Karma