Getting Data In

Parse apache access log in transforms.conf

shangshin
Builder

I noticed there are 2 default sourcetype for apache log. However, we are using a different format in out apache web server. (see Logformat below). I assume I need to use regular expression in transforms.conf. Is that correct? If yes, where can I see the default sample so I can create the right kv fields in transforms.conf. Thanks!

%t %h \"%{Proxy-Remote-User}i\" \"%{User-Agent}i\" %m %H \"%U\" \"%q\" %>s %b %T

0 Karma
1 Solution

lguinn2
Legend

I am not sure what you mean by "the default sample." So here is an example of the configuration files that define a customized sourcetype. You can do this for any sort of input that is in a format that Splunk does not already recognize. I tried to do it for your actual custom log, but I am sure I didn't get it exactly right.

  • First, you should create the props.conf and transforms.conf files, before setting up your input on the production server. It is best to do this on a test server (you can install Splunk on your PC and use that as a test server), where you can make sure it is working the way you want. Upload a sample of your apache data, too.
  • I called the new sourcetype apache_custom in my example. In props.conf, I tell Splunk to do two things: (1) override the host before indexing the data and (2) set up the fields, based on your description above, for use when searching and reporting on the data.
  • transforms.conf contains the specification for how to do the override and field extraction
  • Once you are happy with the props.conf and transforms.conf, set up the inputs.conf to actually bring your data into Splunk

I show the host override below, but you may not need it. If you don't, delete it and things will be more efficient. But - if your Apache log will contain information from a variety of web hosts, you must have the override to make sure that Splunk assigns the proper host name to each event in the data.

The apache_custom_fields stanza in transforms.conf is where the field extraction is actually set up. The fields are defined by a regular expression. I hope I got it right, but I might not have, depending on your actual data. I suggest that you take the regular expression below and put it in a regular expression testing tool. (Note the the regular expression is line-wrapped below - there is not actually a newline in the regular expression.) Add a sample of your log file and see if the results make sense. You can try http://gskinner.com/RegExr/ but there are others.

Look in the manuals at the following locations for more details:
http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter
http://docs.splunk.com/Documentation/Splunk/latest/Data/Createsourcetypes#Edit_props.conf
http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Addfieldsatsearchtime

inputs.conf

[monitor://pathtoyourlogfiles]
sourcetype=apache_custom

props.conf

[apache_custom]
TRANSFORMS-h1=hostoverride
REPORT-r1=apache_custom_fields

transforms.conf

[apache_custom_fields]
REGEX=] \w+ (\S+) "(.*?)" (OPTIONS|GET|HEAD|POST|PUT|DELETE|TRACE|CONNECT) (HTTP.\S+) (\S+) (\S+) (\d{3}) (\d+) (\d+)
FORMAT=clientip::$1 useragent::$2 method::$3 protocol::$4 url::$5 uri_query::$6 status::$7 bytes::$8 timetaken::$9 

[hostoverride]
DEST_KEY = MetaData:Host
REGEX = ] (\w+)
FORMAT = host::$1

View solution in original post

lguinn2
Legend

I am not sure what you mean by "the default sample." So here is an example of the configuration files that define a customized sourcetype. You can do this for any sort of input that is in a format that Splunk does not already recognize. I tried to do it for your actual custom log, but I am sure I didn't get it exactly right.

  • First, you should create the props.conf and transforms.conf files, before setting up your input on the production server. It is best to do this on a test server (you can install Splunk on your PC and use that as a test server), where you can make sure it is working the way you want. Upload a sample of your apache data, too.
  • I called the new sourcetype apache_custom in my example. In props.conf, I tell Splunk to do two things: (1) override the host before indexing the data and (2) set up the fields, based on your description above, for use when searching and reporting on the data.
  • transforms.conf contains the specification for how to do the override and field extraction
  • Once you are happy with the props.conf and transforms.conf, set up the inputs.conf to actually bring your data into Splunk

I show the host override below, but you may not need it. If you don't, delete it and things will be more efficient. But - if your Apache log will contain information from a variety of web hosts, you must have the override to make sure that Splunk assigns the proper host name to each event in the data.

The apache_custom_fields stanza in transforms.conf is where the field extraction is actually set up. The fields are defined by a regular expression. I hope I got it right, but I might not have, depending on your actual data. I suggest that you take the regular expression below and put it in a regular expression testing tool. (Note the the regular expression is line-wrapped below - there is not actually a newline in the regular expression.) Add a sample of your log file and see if the results make sense. You can try http://gskinner.com/RegExr/ but there are others.

Look in the manuals at the following locations for more details:
http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter
http://docs.splunk.com/Documentation/Splunk/latest/Data/Createsourcetypes#Edit_props.conf
http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Addfieldsatsearchtime

inputs.conf

[monitor://pathtoyourlogfiles]
sourcetype=apache_custom

props.conf

[apache_custom]
TRANSFORMS-h1=hostoverride
REPORT-r1=apache_custom_fields

transforms.conf

[apache_custom_fields]
REGEX=] \w+ (\S+) "(.*?)" (OPTIONS|GET|HEAD|POST|PUT|DELETE|TRACE|CONNECT) (HTTP.\S+) (\S+) (\S+) (\d{3}) (\d+) (\d+)
FORMAT=clientip::$1 useragent::$2 method::$3 protocol::$4 url::$5 uri_query::$6 status::$7 bytes::$8 timetaken::$9 

[hostoverride]
DEST_KEY = MetaData:Host
REGEX = ] (\w+)
FORMAT = host::$1

lguinn2
Legend

I think if you look earlier in transforms.conf, you will see these expressions. They aren't documented that I can find, and they aren't any official flavor of regex that I know. But, they are sort of "character classes" that Splunk uses as a shorthand for the sourcetypes that are predefined within Splunk.

Funny you should ask, I just got an answer to this very question a few days ago. 🙂

But that's why I wrote out the regexes in my original example - I couldn't really tell you how to use this syntax correctly.

0 Karma

shangshin
Builder

Cool. Thanks!
One last question -- Where can I find the definition of the reg strings. e.g. nspaces, sbstring, etc

REGEX = ^[[nspaces:clientip]]\s++[[nspaces:ident]]\s++[[nspaces:user]]\s++[[sbstring:req_time]]\s++[[access-request]]\s++[[nspaces:status]
]\s++[nspaces:bytes]?[[al
l:other]]

0 Karma

lguinn2
Legend

You can find the default settings for access_combined and other sourcetypes in $SPLUNK_HOME/etc/system/default
You should look specifically at props.conf and transforms.conf
You will find the regular expressions in transforms.conf

However, you should not make your changes in the default directory.

0 Karma

shangshin
Builder

Thank you! I probably didn't ask my question properly. Let me try to rephrase my question. When I installed splunk, I can see 2 sourcetypes for common apache log files -- acceess_common and access_combined_cookie.
Since my apache log format is coutomized, I have to create the regular expression myself.
This part is time consuming and it will be great if I can reuse old transforms.conf.
My sample eventdata

[21/May/2012:11:50:16 -0400] 10.39.208.3 "my-user-id" "libwww-perl/5.77" GET HTTP/1.1 "http://www.amazon.com" "?search-alias%3Daps&field-keywords=ipad+3&sprefix=ipad%2Caps%2C210" 200 495 0

0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...