Re: Regex - fqdn, basedomain etc. from URL in CK l...

aaronnicoli · ‎08-27-2012

Hi there,

I have taken the following regex from here...

http://splunk-base.splunk.com/answers/9736/revisiting-regex-to-extract-domain-name-from-an-fqdn/1040...

And modified it to suit domains such as .com.au, leaving it like:

(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)

Now, I have data formatted in csv style containing a url string...
To extract the domain/ip string from the data, I use this regex:

(?i)^(?:[^ ]* ){12}.+://(?P<domain>[^:|,|/]+)[/,]?

What I wish to do is create a single regex that will create the domainname,nodata,fqdomain and tld fields from the data extracted using the second extraction of domain.
Can someone please help me combine the two extractions to create a single?
I'm not the best when it comes to splunk regex...

Here is some sample data:

Aug 28 13:05:26 111.111.1.1 28-08-2012; 13:04:48, 26, 111.111.111.11, username@hostname, 125679, 1, text/html, http://global.ebsco-content.com/interfacefiles/12.4.33.0.2/javascript/bundled/_layout2/master.js, default, Educational

aaronnicoli · ‎08-28-2012

Okay so...

Didn't have much luck with the previous response sorry...

I have been fiddling and seeing what I can come up with, this is what I now have:

([^,]+, ){7}[^/]+://(?<basedomain>(\[(?<ip6>[^\]]+)\][:/, ])|((?<ip4>\d+(\.\d+){3})[:/, ])|((?<nodots>[^\.,/: ]+)[:,/ ])|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+))))))

I understand basedomain is broken, but, as is fqdomain and tld... which for the life of me I can't get to work. (ip4 and ip6 [as well as nodots] work nice though)

From the sample data in my first post, this is what I expect to see when the following search is run:

base-search |table basedomain ip4 fqdomain tld

ebsco-content.com <blank> global.ebsco-content.com .com

I need some regex experts to help me on this...

Thanks in advance, Aaron.

kristian_kolb · ‎08-28-2012

The following may not be the exact solution you need, but it may help you along a little bit. It is using the referer_domain field of access_combined logs.

First is the full search that I used, then a slightly more readable version which may not run due to indentation, the a results table.

sourcetype="access_combined" | head 10000 | rex field=referer_domain "https?://(?:[\w-]*\.)*?(?<coming_from>((?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|(?<nodots>[a-zA-Z0-9-_]+$)|(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | table referer_domain ip nodots the_domain coming_from | dedup referer_domain

Base search

sourcetype="access_combined" | head 10000 |

Start rexing, skip the protocol plus optional hostnames/subdomains

rex field=referer_domain "https?://(?:[\w-]*\.)*?

The field coming_from will get the final result

(?<coming_from>(

find the ip address if any (only ipv4 - add ipv6 if needed)

(?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|

or a plain hostname

(?<nodots>[a-zA-Z0-9-_]+$)|

or find the domain (with an optional .co or .com as the second to last part) allowing for up to 5 characters in the top-level (change as needed)

(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" |

output the result for testing

table referer_domain ip nodots the_domain coming_from | dedup referer_domain

Expected result:

referer_domain       ip      nodots   the_domain    coming_from
http://www.blah.com                   blah.com      blah.com
https://1.2.3.4      1.2.3.4                        1.2.3.4
http://my_srv                my_srv                 my_srv
https://a.b.co.uk                     b.co.uk       b.co.uk
http://as.df.jk.edu                   jk.edu        jk.edu

I actually haven't tried the nodots function since I didn't have log data to test it on.

Hope this helps,

Kristian

aaronnicoli · ‎08-28-2012

Kristian, thanks for the in depth answer, very much appreciate it.

I have it all running in the search app using rex like you explained, however, my issue is making an "all-in-one" regex that finds the write field in the csv, then runs the "domain" regex on it...

I think this is what I am after:

://(?:[\w-]*\.)*?(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)

I'm away from office though at the moment, I will let you know when I'm back.

kristian_kolb · ‎08-28-2012

updated a typo

aaronnicoli · ‎08-27-2012

Please help guys, I am really at a loss here 😞

Regex - fqdn, basedomain etc. from URL in CK log

Exciting News: The AppDynamics Community Joins Splunk!

The All New Performance Insights for Splunk

Good Sourcetype Naming

Are you a member of the Splunk Community?

Regex - fqdn, basedomain etc. from URL in CK log

Exciting News: The AppDynamics Community Joins Splunk!

The All New Performance Insights for Splunk

Good Sourcetype Naming