topic Re: Regex - fqdn, basedomain etc. from URL in CK log in Splunk Search

Regex - fqdn, basedomain etc. from URL in CK log

aaronnicoli — Tue, 28 Aug 2012 03:15:21 GMT

Hi there,

I have taken the following regex from here...

http://splunk-base.splunk.com/answers/9736/revisiting-regex-to-extract-domain-name-from-an-fqdn/10407

And modified it to suit domains such as .com.au, leaving it like:

(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)

Now, I have data formatted in csv style containing a url string...
To extract the domain/ip string from the data, I use this regex:

(?i)^(?:[^ ]* ){12}.+://(?P<domain>[^:|,|/]+)[/,]?

What I wish to do is create a single regex that will create the domainname,nodata,fqdomain and tld fields from the data extracted using the second extraction of domain.
Can someone please help me combine the two extractions to create a single?
I'm not the best when it comes to splunk regex...

Here is some sample data:

Aug 28 13:05:26 111.111.1.1 28-08-2012; 13:04:48, 26, 111.111.111.11, username@hostname, 125679, 1, text/html, http://global.ebsco-content.com/interfacefiles/12.4.33.0.2/javascript/bundled/_layout2/master.js, default, Educational

Re: Regex - fqdn, basedomain etc. from URL in CK log

aaronnicoli — Tue, 28 Aug 2012 05:25:39 GMT

Please help guys, I am really at a loss here 😞

Re: Regex - fqdn, basedomain etc. from URL in CK log

kristian_kolb — Tue, 28 Aug 2012 09:09:53 GMT

The following may not be the exact solution you need, but it may help you along a little bit. It is using the referer_domain field of access_combined logs.

First is the full search that I used, then a slightly more readable version which may not run due to indentation, the a results table.

sourcetype="access_combined" | head 10000 | rex field=referer_domain "https?://(?:[\w-]*\.)*?(?<coming_from>((?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|(?<nodots>[a-zA-Z0-9-_]+$)|(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | table referer_domain ip nodots the_domain coming_from | dedup referer_domain

Base search

sourcetype="access_combined" | head 10000 |

Start rexing, skip the protocol plus optional hostnames/subdomains

rex field=referer_domain "https?://(?:[\w-]*\.)*?

The field coming_from will get the final result

(?<coming_from>(

find the ip address if any (only ipv4 - add ipv6 if needed)

(?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|

or a plain hostname

(?<nodots>[a-zA-Z0-9-_]+$)|

or find the domain (with an optional .co or .com as the second to last part) allowing for up to 5 characters in the top-level (change as needed)

(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" |

output the result for testing

table referer_domain ip nodots the_domain coming_from | dedup referer_domain

Expected result:

referer_domain       ip      nodots   the_domain    coming_from
http://www.blah.com                   blah.com      blah.com
https://1.2.3.4      1.2.3.4                        1.2.3.4
http://my_srv                my_srv                 my_srv
https://a.b.co.uk                     b.co.uk       b.co.uk
http://as.df.jk.edu                   jk.edu        jk.edu

I actually haven't tried the nodots function since I didn't have log data to test it on.

Hope this helps,

Kristian

Re: Regex - fqdn, basedomain etc. from URL in CK log

kristian_kolb — Tue, 28 Aug 2012 09:11:34 GMT

updated a typo

Re: Regex - fqdn, basedomain etc. from URL in CK log

aaronnicoli — Tue, 28 Aug 2012 14:07:33 GMT

Kristian, thanks for the in depth answer, very much appreciate it.

I have it all running in the search app using rex like you explained, however, my issue is making an "all-in-one" regex that finds the write field in the csv, then runs the "domain" regex on it...

I think this is what I am after:

://(?:[\w-]*\.)*?(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)

I'm away from office though at the moment, I will let you know when I'm back.

Re: Regex - fqdn, basedomain etc. from URL in CK log

aaronnicoli — Wed, 29 Aug 2012 01:28:37 GMT

Okay so...

Didn't have much luck with the previous response sorry...

I have been fiddling and seeing what I can come up with, this is what I now have:

([^,]+, ){7}[^/]+://(?<basedomain>(\[(?<ip6>[^\]]+)\][:/, ])|((?<ip4>\d+(\.\d+){3})[:/, ])|((?<nodots>[^\.,/: ]+)[:,/ ])|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+))))))

I understand basedomain is broken, but, as is fqdomain and tld... which for the life of me I can't get to work. (ip4 and ip6 [as well as nodots] work nice though)

From the sample data in my first post, this is what I expect to see when the following search is run:

base-search |table basedomain ip4 fqdomain tld

ebsco-content.com <blank> global.ebsco-content.com .com

I need some regex experts to help me on this...

Thanks in advance, Aaron.