Regex: Issue with domain name extraction from URL ...

rturk · ‎06-25-2011

Greetings Splunkers!

I posed this question in the IRC channel, but thought I'd put it in here as well just in case anyone has a similar issue.

I have a "dest_host" field which contains URL values, eg:

www.domain-1.com
tiles.domain-1.com
www.domain-1.com:80
domain-1.com
domain-2.com
ems.domain-2.com
etc...

Essentially, the format is:

<optional-sub-domain>.<domain-name>.<top-level-domain>

I wish to create a regex that extracts only the domain-name value, however my efforts have been complicated by the values that do not have the sub-domain value specified.

Note that I am not overly concerned with values with the port number after them as they are statistically insignificant (but if there's an easy way to include them, that's fine).

The rex I have put together for this is:

rex field=dest_host "(?<domain>.*).com"

But given the above values, this extracts the following:

dest_host, domain
www.domain-1.com, www.domain-1 (incorrect)
tiles.domain-1.com, tiles.domain-1 (incorrect)
www.domain-1.com:80, www.domain-1 (incorrect)
domain-1.com, domain-1 (correct)
domain-2.com, domain-2 (correct)
ems.domain-2.com, ems.domain-2 (incorrect)
etc...

I have played around with the start of line character, and greedy & non-greedy operators, but just can't get it to do what I want. Can a regex kung-fu master please lend some assistance?

Many thanks in advance!

stevelangdon · ‎07-11-2012

(spam removed)

Ayn · ‎07-11-2012

Malicious or not, it's certainly not in any way related to Splunk or this specific question either for that matter. Spam.

stevelangdon · ‎07-11-2012

wish the moderators can see this conversation, you cannot simply state any site is malicious. That can have such a negative effect on my online reputation.

stevelangdon · ‎07-11-2012

that offends me. I can assure you it is not malicious, and is my own website, built through months of hard work. I'm not running away anywhere, and it was not automated. Typed it with my own hands.

rturk · ‎07-11-2012

Don't click. Likely to be malicious website. Already reported.

tollops · ‎08-30-2011

something like this should help:

rex field=dest_host "^(\w|-)+\.(?<domain>.*)"

rturk · ‎07-21-2011

Courtesy of the RFC, here's a regex that will extract ALL relevant info from a URL field (single line):

((?<cs_uri_scheme>[^:/?#]+):)?(//(?<cs_uri_authority>[^/?#]*))?(?<cs_uri_stem>[^?#|\s]*)(\?(?<cs_uri_query>[^#|^\s]*))?(#(?<cs_uri_fragment>.*[^\s]))?

Preceded with the following, you can specify field number 'X' in a space-delimited event (in this example, field #4)

(?i)^(?:[^\s]* ){4}

So given the event (single line):

123.123.123.123 2011-07-05 15:20:00 http://example.com/over/there/index.dtb?type=animal&name=narwhal#nose HTTP cht-cdn220-is-3 302

and the regex (single line):

(?i)^(?:[^\s]* ){4}((?<cs_uri_scheme>[^:/?#]+):)?(//(?<cs_uri_authority>[^/?#]*))?(?<cs_uri_stem>[^?#|\s]*)(\?(?<cs_uri_query>[^#|^\s]*))?(#(?<cs_uri_fragment>.*[^\s]))?

The following fields will be extracted via regex:

cs-uri-scheme:	http
cs-uri-authority:	example.com
cs-uri-stem:	/over/there/index.dtb
cs-uri-query:	type=animal&name=narwhal
cs-uri-fragment:	nose

I hope this is of some assistance to someone 🙂

David · ‎07-22-2011

Of course, that doesn't answer the part of your question for separating the example from the com, or the example from the co.uk, or any other random gtlds. I think at this point, that requires an enormous, enormous list of possible gtlds.

rturk · ‎06-26-2011

Thanks to Ziegfried, this got me to where I needed:

"(?<domain>[^\.]+)\.com"

I'll play around with it a bit more and post my results 🙂

ziegfried · ‎06-26-2011

rex field=dest_host "(?<domain>[^\.]+)\.([^\.]+)$"

rturk · ‎06-26-2011

Thanks Ziegfried... although I just realised I left out the option where there is a country code at the end, eg:

www.domain-1.com.au

Sorry about that 😛

Regex: Issue with domain name extraction from URL field

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)