Splunk Search
Highlighted

Revisiting REGEX to extract domain name from an FQDN

Communicator

I've read and used the REGEX commands in this URL: http://answers.splunk.com/questions/8028/extracting-domain-name-out-of-a-url, but still come across issues with my extractions. In many cases the logs will have domains that end in .co.cc, .co.uk and .co.au - to name a few. The REGEX examples in the link above only extract the tail end - for example .co.cc. I need to figure out how to grab the name prior - for example guardian.co.uk.

Any ideas on how to get this REGEX to work?

Tags (1)
Highlighted

Re: Revisiting REGEX to extract domain name from an FQDN

Splunk Employee
Splunk Employee

What exactly is the regex you are using. Is it the same as in question 8028? That regex [^\.\s]+\.[^\.\s]+$ is built so that it grabs the last two portions of a hostname. You need to make sure that your regex then doesnt just grab the last 2 portions but the last three.

now, you also need to be careful, because you might have non-uk hosts, in which case the domain is only the last two portions, such as whatever.com.

Perhaps then a more complex regex that grabs the following might work for you:

([^\.\s]+\.co\.[^\.\s]+$)|([^\.\s]+\.[^\.\s]+$)

In my quick test for the following:

boo.com
foo.bar
gentoo.what.bar
what.are.you.talking.about
gentoo.co.uk

the above regex captures:

boo.com
foo.bar
what.bar
talking.about
gentoo.co.uk

This regex however takes in consideration only those domains that will have .co., if you have other domains that will be used you might want to change your regex to accept whatever other middle portion of the domain might be.

Hope this helps.

0 Karma
Highlighted

Re: Revisiting REGEX to extract domain name from an FQDN

Communicator

I copied/pasted this REGEX into a field extract. It does grab the different .co.xx domains, but now Splunk doesn't show the other "normal" domains, like google.com, etc. I've tested the REGEX in RegexBuddy and it does extract correctly.

Any idea what may be missing?

0 Karma
Highlighted

Re: Revisiting REGEX to extract domain name from an FQDN

Motivator

At some level, it becomes increasingly important to define the rules you want to follow.

The more precise the definition, and the nature of the constraints, will dictate how complex the regex gets. More complex expressions also may suffer in terms of performance. In particular, the regex engine may start having to scan the same text multiple times looking for different ways to match.

For example, here's one way to look at defining the domain name that would fit most purposes:

  1. A domain name is a single (optional) segment of the hostname, plus a TLD
  2. If the last two segments are each two characters, the TLD is two segments (.co.uk)
  3. If the TLD is not in .co.uk form, then it is one segment
  4. If the hostname is an IP address, then capture the entire string
  5. Hostnames of only 1-2 characters are considered valid
  6. Hostnames beginning with a dot are not considered valid.

A simplistic approach is to break that down into three possibilities, and check each one:

  1. An IP address
  2. A string with no dots
  3. A fullly qualified domain name


Given those three possibilities, here's one (ugly and inefficient) regex that would work:

(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>(?:[^\.\s]{2})(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)


That should produce the following results:

u                          ->    u
uk                         ->    uk
hostname                   ->    hostname
0.track.ning.com           ->    ning.com
0.tqn.com                  ->    tqn.com
0.r.msn.com                ->    msn.com
0.52.channel.facebook.com  ->    facebook.com
fe80::48e:b5c4:5670        ->    fe80::48e:b5c4:5670
fe80::f002:192.168.1.1     ->    fe80::f002:192.168.1.1
127.0.0.1                  ->    127.0.0.1
gentoo.co.uk               ->    gentoo.co.uk
what.are.you.talking.about ->    talking.about
gentoo.what.bar            ->    what.bar
foo.ba                     ->    oo.ba
boo.com                    ->    boo.com
xx.boo.com                 ->    boo.com
Highlighted

Re: Revisiting REGEX to extract domain name from an FQDN

New Member

This regexp seems to be doinf the job but I don't seem able to add it to the field extractor as it does not match while it does to me on other regexp tools:

([a-z0-9_\-]{1,5})?(:\/\/)?(([a-z0-9_\-]{1,})(:([a-z0-9_\-]{1,}))?\@)?((www\.)|([a-z0-9_\-]{1,}\.)+)?([a-z0-9_\-]{3,})\.([a-z]{2,4})(\/([a-z0-9_\-]{1,}\/)+)?([a-z0-9_\-]{1,})?(\.[a-z]{2,})?(\?)?(((\&)?[a-z0-9_\-]{1,}(\=[a-z0-9_\-]{1,})?)+)?

Removing the first part would also eliminate the "http://"

([a-z0-9_\-]{1,5})?(:\/\/)?(([a-z0-9_\-]{1,})(:([a-z0-9_\-]{1,}))?\@)?

It would be great for splunk to include an autodetection tool for this. In my case the interest comes to be able to add all traffic say to alsur.es (www.alsur.es, img1.alsur.es, cdn.alsur.es:77...) under one only count "alsur.es" or even say just "alsur"

0 Karma