Splunk Search

Extracting fields from an existing Field

psheck117
New Member

I am working on some http_referer analysis from my proxy logs, seems like an interesting thing to do. I want to do an additional search time field extraction and rip apart the http_referer field to provide more search functionality from the data.

Can I do something like:

transforms.conf:
REGEX = field=http_referrer ^(?\w+)://

*Yes, I realize my field name isn't the same as the RFC... haha, official misspelling 😕

I can build the whole thing out with a single line, and I am sure the hardware can handle the overhead without issue (I hope), but I'd rather have field anchor of some sort to go off of.

Am I missing something on this?

After thoughts: I can do a content match on the :// as there is nothing in the logs that should contain that combination of characters in ASCII, any colons in the URI will be in hex or something else.

Thanks.

0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

I believe you're looking for the SOURCE_KEY setting in transforms.conf, see http://docs.splunk.com/Documentation/Splunk/latest/Admin/transformsconf for details.

As for building a regex to match on "something ending with ://", that will work but not be a pinnacle of efficiency. The automaton working to match the regex will constantly try to start, walk along, and then fail repeatedly - much like running a Splunk search using key=*value. It's much faster to have quick failures by anchoring the start to something.

View solution in original post

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I believe you're looking for the SOURCE_KEY setting in transforms.conf, see http://docs.splunk.com/Documentation/Splunk/latest/Admin/transformsconf for details.

As for building a regex to match on "something ending with ://", that will work but not be a pinnacle of efficiency. The automaton working to match the regex will constantly try to start, walk along, and then fail repeatedly - much like running a Splunk search using key=*value. It's much faster to have quick failures by anchoring the start to something.

View solution in original post

0 Karma

psheck117
New Member

Yeah, I realized that after I committed my transform... reading rfc1945 has been enlightening to say the least. Here is a crack at a proper REGEX for scheme, I will comment and add the http_referer_uri_extension after testing.

REGEX = (?[a-zA-Z+.-]+)://(?S[^/]+)((?/.[^?]+))?((??.*))?

Ha! Looking at my regex makes me question if I can tighten it a little better too.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

A rather theoretical comment on that - if you truly want to capture every imaginable URI scheme, using \w+ isn't going to catch them all. There are more or less obscure schemes with dots and dashes in them.

0 Karma

psheck117
New Member

Here is the full regex for my http_referer extraction. If you do something like this you may be surprised with what shows up as a referrer scheme.

REGEX = (?\w+)://(?\S[^/]+)((?/.[^?]+))?((?\?.*))?

I could probably get into the depth of http_referer_uri_extension, but that is hit or miss, and right now I am not sure I need the detail. Though, thinking about it, I could slip it in there.

My first inclination was to break it out into multiple extractions too.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

If you know you're only going to encounter http and https, consider using https? as your regex... it'll at least help someone read it later.

0 Karma

psheck117
New Member

Thanks Martin! I will check out & use SOURCE_KEY, I knew I was missing something.

As for my regex, definitely not going to end on ://. Though, there is only one place in the event that will exist, http:// or https:// in the referrer field, if it exists at all. I didn't want to put my whole regex into the question, so left at the first extracted field.

Thanks again!

0 Karma