I've read and used the REGEX commands in this URL: http://answers.splunk.com/questions/8028/extracting-domain-name-out-of-a-url, but still come across issues with my extractions. In many cases the logs will have domains that end in .co.cc, .co.uk and .co.au - to name a few. The REGEX examples in the link above only extract the tail end - for example .co.cc. I need to figure out how to grab the name prior - for example guardian.co.uk.
Any ideas on how to get this REGEX to work?
What exactly is the regex you are using. Is it the same as in question 8028? That regex
[^\.\s]+\.[^\.\s]+$ is built so that it grabs the last two portions of a hostname. You need to make sure that your regex then doesnt just grab the last 2 portions but the last three.
now, you also need to be careful, because you might have non-uk hosts, in which case the domain is only the last two portions, such as
Perhaps then a more complex regex that grabs the following might work for you:
In my quick test for the following:
boo.com foo.bar gentoo.what.bar what.are.you.talking.about gentoo.co.uk
the above regex captures:
boo.com foo.bar what.bar talking.about gentoo.co.uk
This regex however takes in consideration only those domains that will have .co., if you have other domains that will be used you might want to change your regex to accept whatever other middle portion of the domain might be.
Hope this helps.
I copied/pasted this REGEX into a field extract. It does grab the different .co.xx domains, but now Splunk doesn't show the other "normal" domains, like google.com, etc. I've tested the REGEX in RegexBuddy and it does extract correctly.
Any idea what may be missing?
At some level, it becomes increasingly important to define the rules you want to follow.
The more precise the definition, and the nature of the constraints, will dictate how complex the regex gets. More complex expressions also may suffer in terms of performance. In particular, the regex engine may start having to scan the same text multiple times looking for different ways to match.
For example, here's one way to look at defining the domain name that would fit most purposes:
.co.ukform, then it is one segment
A simplistic approach is to break that down into three possibilities, and check each one:
Given those three possibilities, here's one (ugly and inefficient) regex that would work:
That should produce the following results:
u -> u uk -> uk hostname -> hostname 0.track.ning.com -> ning.com 0.tqn.com -> tqn.com 0.r.msn.com -> msn.com 0.52.channel.facebook.com -> facebook.com fe80::48e:b5c4:5670 -> fe80::48e:b5c4:5670 fe80::f002:192.168.1.1 -> fe80::f002:192.168.1.1 127.0.0.1 -> 127.0.0.1 gentoo.co.uk -> gentoo.co.uk what.are.you.talking.about -> talking.about gentoo.what.bar -> what.bar foo.ba -> oo.ba boo.com -> boo.com xx.boo.com -> boo.com
This regexp seems to be doinf the job but I don't seem able to add it to the field extractor as it does not match while it does to me on other regexp tools:
Removing the first part would also eliminate the "http://"
It would be great for splunk to include an autodetection tool for this. In my case the interest comes to be able to add all traffic say to alsur.es (www.alsur.es, img1.alsur.es, cdn.alsur.es:77...) under one only count "alsur.es" or even say just "alsur"