Getting Data In

Parse apache access log in transforms.conf

shangshin
Builder

I noticed there are 2 default sourcetype for apache log. However, we are using a different format in out apache web server. (see Logformat below). I assume I need to use regular expression in transforms.conf. Is that correct? If yes, where can I see the default sample so I can create the right kv fields in transforms.conf. Thanks!

%t %h \"%{Proxy-Remote-User}i\" \"%{User-Agent}i\" %m %H \"%U\" \"%q\" %>s %b %T

0 Karma
1 Solution

lguinn2
Legend

I am not sure what you mean by "the default sample." So here is an example of the configuration files that define a customized sourcetype. You can do this for any sort of input that is in a format that Splunk does not already recognize. I tried to do it for your actual custom log, but I am sure I didn't get it exactly right.

  • First, you should create the props.conf and transforms.conf files, before setting up your input on the production server. It is best to do this on a test server (you can install Splunk on your PC and use that as a test server), where you can make sure it is working the way you want. Upload a sample of your apache data, too.
  • I called the new sourcetype apache_custom in my example. In props.conf, I tell Splunk to do two things: (1) override the host before indexing the data and (2) set up the fields, based on your description above, for use when searching and reporting on the data.
  • transforms.conf contains the specification for how to do the override and field extraction
  • Once you are happy with the props.conf and transforms.conf, set up the inputs.conf to actually bring your data into Splunk

I show the host override below, but you may not need it. If you don't, delete it and things will be more efficient. But - if your Apache log will contain information from a variety of web hosts, you must have the override to make sure that Splunk assigns the proper host name to each event in the data.

The apache_custom_fields stanza in transforms.conf is where the field extraction is actually set up. The fields are defined by a regular expression. I hope I got it right, but I might not have, depending on your actual data. I suggest that you take the regular expression below and put it in a regular expression testing tool. (Note the the regular expression is line-wrapped below - there is not actually a newline in the regular expression.) Add a sample of your log file and see if the results make sense. You can try http://gskinner.com/RegExr/ but there are others.

Look in the manuals at the following locations for more details:
http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter
http://docs.splunk.com/Documentation/Splunk/latest/Data/Createsourcetypes#Edit_props.conf
http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Addfieldsatsearchtime

inputs.conf

[monitor://pathtoyourlogfiles]
sourcetype=apache_custom

props.conf

[apache_custom]
TRANSFORMS-h1=hostoverride
REPORT-r1=apache_custom_fields

transforms.conf

[apache_custom_fields]
REGEX=] \w+ (\S+) "(.*?)" (OPTIONS|GET|HEAD|POST|PUT|DELETE|TRACE|CONNECT) (HTTP.\S+) (\S+) (\S+) (\d{3}) (\d+) (\d+)
FORMAT=clientip::$1 useragent::$2 method::$3 protocol::$4 url::$5 uri_query::$6 status::$7 bytes::$8 timetaken::$9 

[hostoverride]
DEST_KEY = MetaData:Host
REGEX = ] (\w+)
FORMAT = host::$1

View solution in original post

lguinn2
Legend

I am not sure what you mean by "the default sample." So here is an example of the configuration files that define a customized sourcetype. You can do this for any sort of input that is in a format that Splunk does not already recognize. I tried to do it for your actual custom log, but I am sure I didn't get it exactly right.

  • First, you should create the props.conf and transforms.conf files, before setting up your input on the production server. It is best to do this on a test server (you can install Splunk on your PC and use that as a test server), where you can make sure it is working the way you want. Upload a sample of your apache data, too.
  • I called the new sourcetype apache_custom in my example. In props.conf, I tell Splunk to do two things: (1) override the host before indexing the data and (2) set up the fields, based on your description above, for use when searching and reporting on the data.
  • transforms.conf contains the specification for how to do the override and field extraction
  • Once you are happy with the props.conf and transforms.conf, set up the inputs.conf to actually bring your data into Splunk

I show the host override below, but you may not need it. If you don't, delete it and things will be more efficient. But - if your Apache log will contain information from a variety of web hosts, you must have the override to make sure that Splunk assigns the proper host name to each event in the data.

The apache_custom_fields stanza in transforms.conf is where the field extraction is actually set up. The fields are defined by a regular expression. I hope I got it right, but I might not have, depending on your actual data. I suggest that you take the regular expression below and put it in a regular expression testing tool. (Note the the regular expression is line-wrapped below - there is not actually a newline in the regular expression.) Add a sample of your log file and see if the results make sense. You can try http://gskinner.com/RegExr/ but there are others.

Look in the manuals at the following locations for more details:
http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter
http://docs.splunk.com/Documentation/Splunk/latest/Data/Createsourcetypes#Edit_props.conf
http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Addfieldsatsearchtime

inputs.conf

[monitor://pathtoyourlogfiles]
sourcetype=apache_custom

props.conf

[apache_custom]
TRANSFORMS-h1=hostoverride
REPORT-r1=apache_custom_fields

transforms.conf

[apache_custom_fields]
REGEX=] \w+ (\S+) "(.*?)" (OPTIONS|GET|HEAD|POST|PUT|DELETE|TRACE|CONNECT) (HTTP.\S+) (\S+) (\S+) (\d{3}) (\d+) (\d+)
FORMAT=clientip::$1 useragent::$2 method::$3 protocol::$4 url::$5 uri_query::$6 status::$7 bytes::$8 timetaken::$9 

[hostoverride]
DEST_KEY = MetaData:Host
REGEX = ] (\w+)
FORMAT = host::$1

lguinn2
Legend

I think if you look earlier in transforms.conf, you will see these expressions. They aren't documented that I can find, and they aren't any official flavor of regex that I know. But, they are sort of "character classes" that Splunk uses as a shorthand for the sourcetypes that are predefined within Splunk.

Funny you should ask, I just got an answer to this very question a few days ago. 🙂

But that's why I wrote out the regexes in my original example - I couldn't really tell you how to use this syntax correctly.

0 Karma

shangshin
Builder

Cool. Thanks!
One last question -- Where can I find the definition of the reg strings. e.g. nspaces, sbstring, etc

REGEX = ^[[nspaces:clientip]]\s++[[nspaces:ident]]\s++[[nspaces:user]]\s++[[sbstring:req_time]]\s++[[access-request]]\s++[[nspaces:status]
]\s++[nspaces:bytes]?[[al
l:other]]

0 Karma

lguinn2
Legend

You can find the default settings for access_combined and other sourcetypes in $SPLUNK_HOME/etc/system/default
You should look specifically at props.conf and transforms.conf
You will find the regular expressions in transforms.conf

However, you should not make your changes in the default directory.

0 Karma

shangshin
Builder

Thank you! I probably didn't ask my question properly. Let me try to rephrase my question. When I installed splunk, I can see 2 sourcetypes for common apache log files -- acceess_common and access_combined_cookie.
Since my apache log format is coutomized, I have to create the regular expression myself.
This part is time consuming and it will be great if I can reuse old transforms.conf.
My sample eventdata

[21/May/2012:11:50:16 -0400] 10.39.208.3 "my-user-id" "libwww-perl/5.77" GET HTTP/1.1 "http://www.amazon.com" "?search-alias%3Daps&field-keywords=ipad+3&sprefix=ipad%2Caps%2C210" 200 495 0

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...