Splunk Search

Advanced documentation for field extraction/transformation?

Engager

I'm trying to make sense of the default access-extractions transform so that I can modify it a bit. I've been nosing around splunk answers and the online Admin Manual. In particular, the "Use the Field transformations page in Manager" page.

Abbreviated version of the default access-extractions regex:

^[[nspaces:clientip]]\s++[[nspaces:ident]]\s++[[nspaces:user]]\s++

I see that nspaces is another transform, though I'm not sure what the :clientip means, for example. Basically I want to prepend some fields to the expression.

My syslog-ng log output is the same as the common apache access log, but with a few more fields at the start of each log line. When I simply clone the access-extractions transform, make no modifications except for changing the Name field, it kicks back "Please enter all required fields" indicating in red that Event Format is the required field. When I look at the default access-extractions transform (or any others) the Event Format field is empty, so it doesn't give me much to go on. Would those names (:client, :ident, :user, etc) be an indication that I need to do something like clientip::$1 ident::$2 user::$3 etc...?

Thanks in advance!

-- Andy

Super Champion

You mentioned the message "Please enter all required fields". Are you trying to edit these regexes from UI? If so, I'm guessing that these modular regular expressions will fail because the UI doesn't understand them. With regexes this complicated, I would edit the transforms.conf file directly.

0 Karma

Super Champion

I've built upon these before. Like gkanapathy said, there is some docs in the system/default/transforms.conf file itself, and yeah, it's rather ugly.

That said, with a little patience, it's not too bad to figure out what going on.

Basically these take the form of [[<transfom_stanza_name>:<field_name>]] So in the example you've asked about [[nspaces:clientip]] means use the nspaces transformer (which simply means no spaces, pretty simple) and extract the field as the name clientip. (You may also notice that some of these take field names, and other have the field names build into the transformers themselves.)

Also, the "\s++" seemed really weird to me at first. But as it turns out this is just a normal PCRE-supported regular expression syntax (but not all regex engines support it). This simply means that no backtracking can be done after it matches. (I think may also be called (or related to atomic grouping.. idunno). For most purposes, think of this as a slightly faster "\s+", but I don't recommend that you start using it yourself, unless you read up on it. (I've gotten bit by this a few times.)


The other thing I struggled with was the fact that the default access-extractions contained so many helpful field extractions already, I really didn't want to try to re-write all that into my own "regular style" regex. Writing a regex that complicated by hand can be pretty daunting.

Take a look at the bc_uri transformer. Who wants recreate that beast?

REGEX = (?<uri>[[bc_domain:uri_]]?+(?<uri_path>[[uri_root]]?[[uri_seg]]*(?<file>[^\s\?/]+)?)(?:\?(?<uri_query>[^\s]*))?)

Would become:

REGEX = (?<uri>(?<domain>\w++://[^/\s"]++)?+(?<uri_path>/++(?<root>(?:\\"|[^\s\?/"])++)/++?(?:\\"|[^\s\?/"])*+/++*(?<file>[^\s\?/]+)?)(?:\?(?<uri_query>[^\s]*))?)

Keep in mind that the bc_url is just one portion of the entire access-request transformer, which is part of the even larger access-extractions transfomermer. All of which leads me to belive, you can make your own by building on top of the access-extractions, like so:

[my-custom-access-extractions]
REGEX = [[access-extractions]]\s++[[nspace:my_trailing_field]]

Doh, just realize this may not work for you. The access-extractions transformer starts with a ^ (start of line). However, you should still be able to just copy the entire REGEX and stick your extra fields after ^ and before the [[nspaces:clientip]]. Seems like it's worth a try.


If you post some examples of your modified format, I'm guessing that someone will help you out with getting a working regex (modular regex or otherwise)...

Path Finder

Thanks Lowell,

This is a great response and very helpful to my current issue.

0 Karma

Splunk Employee
Splunk Employee

No. These terms are modular regexes that refer to other regular expressions defined in transforms.conf. There is no documentation on them other than the etc/system/default/transforms.conf file itself. There is also a tool in the Splunk bin/pcregextest that lets you use and test them. I would recommend that most people simply avoid using or looking at these or trying to do anything with them, and simply use plain PCRE regex.