I am working on some http_referer analysis from my proxy logs, seems like an interesting thing to do. I want to do an additional search time field extraction and rip apart the http_referer field to provide more search functionality from the data.
Can I do something like:
transforms.conf:
REGEX = field=http_referrer ^(?
*Yes, I realize my field name isn't the same as the RFC... haha, official misspelling 😕
I can build the whole thing out with a single line, and I am sure the hardware can handle the overhead without issue (I hope), but I'd rather have field anchor of some sort to go off of.
Am I missing something on this?
After thoughts: I can do a content match on the :// as there is nothing in the logs that should contain that combination of characters in ASCII, any colons in the URI will be in hex or something else.
Thanks.
I believe you're looking for the SOURCE_KEY setting in transforms.conf, see http://docs.splunk.com/Documentation/Splunk/latest/Admin/transformsconf for details.
As for building a regex to match on "something ending with ://", that will work but not be a pinnacle of efficiency. The automaton working to match the regex will constantly try to start, walk along, and then fail repeatedly - much like running a Splunk search using key=*value. It's much faster to have quick failures by anchoring the start to something.
I believe you're looking for the SOURCE_KEY setting in transforms.conf, see http://docs.splunk.com/Documentation/Splunk/latest/Admin/transformsconf for details.
As for building a regex to match on "something ending with ://", that will work but not be a pinnacle of efficiency. The automaton working to match the regex will constantly try to start, walk along, and then fail repeatedly - much like running a Splunk search using key=*value. It's much faster to have quick failures by anchoring the start to something.
Yeah, I realized that after I committed my transform... reading rfc1945 has been enlightening to say the least. Here is a crack at a proper REGEX for scheme, I will comment and add the http_referer_uri_extension after testing.
REGEX = (?
Ha! Looking at my regex makes me question if I can tighten it a little better too.
A rather theoretical comment on that - if you truly want to capture every imaginable URI scheme, using \w+ isn't going to catch them all. There are more or less obscure schemes with dots and dashes in them.
Here is the full regex for my http_referer extraction. If you do something like this you may be surprised with what shows up as a referrer scheme.
REGEX = (?
I could probably get into the depth of http_referer_uri_extension, but that is hit or miss, and right now I am not sure I need the detail. Though, thinking about it, I could slip it in there.
My first inclination was to break it out into multiple extractions too.
If you know you're only going to encounter http and https, consider using https? as your regex... it'll at least help someone read it later.
Thanks Martin! I will check out & use SOURCE_KEY, I knew I was missing something.
As for my regex, definitely not going to end on ://. Though, there is only one place in the event that will exist, http:// or https:// in the referrer field, if it exists at all. I didn't want to put my whole regex into the question, so left at the first extracted field.
Thanks again!