Splunk Search

Regex: Issue with domain name extraction from URL field

rturk
Builder

Greetings Splunkers!

I posed this question in the IRC channel, but thought I'd put it in here as well just in case anyone has a similar issue.

I have a "dest_host" field which contains URL values, eg:

www.domain-1.com
tiles.domain-1.com
www.domain-1.com:80
domain-1.com
domain-2.com
ems.domain-2.com
etc...

Essentially, the format is:

<optional-sub-domain>.<domain-name>.<top-level-domain>

I wish to create a regex that extracts only the domain-name value, however my efforts have been complicated by the values that do not have the sub-domain value specified.

Note that I am not overly concerned with values with the port number after them as they are statistically insignificant (but if there's an easy way to include them, that's fine).

The rex I have put together for this is:

rex field=dest_host "(?<domain>.*).com"

But given the above values, this extracts the following:

dest_host, domain
www.domain-1.com, www.domain-1 (incorrect)
tiles.domain-1.com, tiles.domain-1 (incorrect)
www.domain-1.com:80, www.domain-1 (incorrect)
domain-1.com, domain-1 (correct)
domain-2.com, domain-2 (correct)
ems.domain-2.com, ems.domain-2 (incorrect)
etc...

I have played around with the start of line character, and greedy & non-greedy operators, but just can't get it to do what I want. Can a regex kung-fu master please lend some assistance?

Many thanks in advance!

0 Karma

stevelangdon
New Member

(spam removed)

0 Karma

Ayn
Legend

Malicious or not, it's certainly not in any way related to Splunk or this specific question either for that matter. Spam.

stevelangdon
New Member

wish the moderators can see this conversation, you cannot simply state any site is malicious. That can have such a negative effect on my online reputation.

0 Karma

stevelangdon
New Member

that offends me. I can assure you it is not malicious, and is my own website, built through months of hard work. I'm not running away anywhere, and it was not automated. Typed it with my own hands.

0 Karma

rturk
Builder

Don't click. Likely to be malicious website. Already reported.

0 Karma

tollops
Explorer

something like this should help:

rex field=dest_host "^(\w|-)+\.(?<domain>.*)"

0 Karma

rturk
Builder

Courtesy of the RFC, here's a regex that will extract ALL relevant info from a URL field (single line):

((?<cs_uri_scheme>[^:/?#]+):)?(//(?<cs_uri_authority>[^/?#]*))?(?<cs_uri_stem>[^?#|\s]*)(\?(?<cs_uri_query>[^#|^\s]*))?(#(?<cs_uri_fragment>.*[^\s]))?

Preceded with the following, you can specify field number 'X' in a space-delimited event (in this example, field #4)

(?i)^(?:[^\s]* ){4}

So given the event (single line):

123.123.123.123 2011-07-05 15:20:00 http://example.com/over/there/index.dtb?type=animal&name=narwhal#nose HTTP cht-cdn220-is-3 302

and the regex (single line):

(?i)^(?:[^\s]* ){4}((?<cs_uri_scheme>[^:/?#]+):)?(//(?<cs_uri_authority>[^/?#]*))?(?<cs_uri_stem>[^?#|\s]*)(\?(?<cs_uri_query>[^#|^\s]*))?(#(?<cs_uri_fragment>.*[^\s]))?

The following fields will be extracted via regex:






cs-uri-scheme:http
cs-uri-authority:example.com
cs-uri-stem:/over/there/index.dtb
cs-uri-query:type=animal&name=narwhal
cs-uri-fragment:nose


I hope this is of some assistance to someone 🙂

David
Splunk Employee
Splunk Employee

Of course, that doesn't answer the part of your question for separating the example from the com, or the example from the co.uk, or any other random gtlds. I think at this point, that requires an enormous, enormous list of possible gtlds.

rturk
Builder

Thanks to Ziegfried, this got me to where I needed:

"(?<domain>[^\.]+)\.com"

I'll play around with it a bit more and post my results 🙂

0 Karma

ziegfried
Influencer
rex field=dest_host "(?<domain>[^\.]+)\.([^\.]+)$"

rturk
Builder

Thanks Ziegfried... although I just realised I left out the option where there is a country code at the end, eg:

www.domain-1.com.au

Sorry about that 😛

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...