Splunk Search

Regex - fqdn, basedomain etc. from URL in CK log

aaronnicoli
Path Finder

Hi there,

I have taken the following regex from here...

http://splunk-base.splunk.com/answers/9736/revisiting-regex-to-extract-domain-name-from-an-fqdn/1040...

And modified it to suit domains such as .com.au, leaving it like:

(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)

Now, I have data formatted in csv style containing a url string...
To extract the domain/ip string from the data, I use this regex:

(?i)^(?:[^ ]* ){12}.+://(?P<domain>[^:|,|/]+)[/,]?

What I wish to do is create a single regex that will create the domainname,nodata,fqdomain and tld fields from the data extracted using the second extraction of domain.
Can someone please help me combine the two extractions to create a single?
I'm not the best when it comes to splunk regex...

Here is some sample data:

Aug 28 13:05:26 111.111.1.1 28-08-2012; 13:04:48, 26, 111.111.111.11, username@hostname, 125679, 1, text/html, http://global.ebsco-content.com/interfacefiles/12.4.33.0.2/javascript/bundled/_layout2/master.js, default, Educational
0 Karma

aaronnicoli
Path Finder

Okay so...

Didn't have much luck with the previous response sorry...

I have been fiddling and seeing what I can come up with, this is what I now have:

([^,]+, ){7}[^/]+://(?<basedomain>(\[(?<ip6>[^\]]+)\][:/, ])|((?<ip4>\d+(\.\d+){3})[:/, ])|((?<nodots>[^\.,/: ]+)[:,/ ])|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+))))))

I understand basedomain is broken, but, as is fqdomain and tld... which for the life of me I can't get to work. (ip4 and ip6 [as well as nodots] work nice though)

From the sample data in my first post, this is what I expect to see when the following search is run:

base-search |table basedomain ip4 fqdomain tld

ebsco-content.com <blank> global.ebsco-content.com .com

I need some regex experts to help me on this...

Thanks in advance, Aaron.

0 Karma

kristian_kolb
Ultra Champion

The following may not be the exact solution you need, but it may help you along a little bit. It is using the referer_domain field of access_combined logs.

First is the full search that I used, then a slightly more readable version which may not run due to indentation, the a results table.

sourcetype="access_combined" | head 10000 | rex field=referer_domain "https?://(?:[\w-]*\.)*?(?<coming_from>((?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|(?<nodots>[a-zA-Z0-9-_]+$)|(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | table referer_domain ip nodots the_domain coming_from | dedup referer_domain

Base search

sourcetype="access_combined" | head 10000 | 

Start rexing, skip the protocol plus optional hostnames/subdomains

rex field=referer_domain "https?://(?:[\w-]*\.)*?

The field coming_from will get the final result

(?<coming_from>(

find the ip address if any (only ipv4 - add ipv6 if needed)

(?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|

or a plain hostname

(?<nodots>[a-zA-Z0-9-_]+$)|

or find the domain (with an optional .co or .com as the second to last part) allowing for up to 5 characters in the top-level (change as needed)

(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | 

output the result for testing

table referer_domain ip nodots the_domain coming_from | dedup referer_domain

Expected result:

referer_domain       ip      nodots   the_domain    coming_from
http://www.blah.com                   blah.com      blah.com
https://1.2.3.4      1.2.3.4                        1.2.3.4
http://my_srv                my_srv                 my_srv
https://a.b.co.uk                     b.co.uk       b.co.uk
http://as.df.jk.edu                   jk.edu        jk.edu

I actually haven't tried the nodots function since I didn't have log data to test it on.

Hope this helps,

Kristian

aaronnicoli
Path Finder

Kristian, thanks for the in depth answer, very much appreciate it.

I have it all running in the search app using rex like you explained, however, my issue is making an "all-in-one" regex that finds the write field in the csv, then runs the "domain" regex on it...

I think this is what I am after:

://(?:[\w-]*\.)*?(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)

I'm away from office though at the moment, I will let you know when I'm back.

0 Karma

kristian_kolb
Ultra Champion

updated a typo

0 Karma

aaronnicoli
Path Finder

Please help guys, I am really at a loss here 😞

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!