Greetings Splunkers!
I posed this question in the IRC channel, but thought I'd put it in here as well just in case anyone has a similar issue.
I have a "dest_host" field which contains URL values, eg:
www.domain-1.com
tiles.domain-1.com
www.domain-1.com:80
domain-1.com
domain-2.com
ems.domain-2.com
etc...
Essentially, the format is:
<optional-sub-domain>.<domain-name>.<top-level-domain>
I wish to create a regex that extracts only the domain-name value, however my efforts have been complicated by the values that do not have the sub-domain value specified.
Note that I am not overly concerned with values with the port number after them as they are statistically insignificant (but if there's an easy way to include them, that's fine).
The rex I have put together for this is:
rex field=dest_host "(?<domain>.*).com"
But given the above values, this extracts the following:
dest_host, domain
www.domain-1.com, www.domain-1 (incorrect)
tiles.domain-1.com, tiles.domain-1 (incorrect)
www.domain-1.com:80, www.domain-1 (incorrect)
domain-1.com, domain-1 (correct)
domain-2.com, domain-2 (correct)
ems.domain-2.com, ems.domain-2 (incorrect)
etc...
I have played around with the start of line character, and greedy & non-greedy operators, but just can't get it to do what I want. Can a regex kung-fu master please lend some assistance?
Many thanks in advance!
(spam removed)
Malicious or not, it's certainly not in any way related to Splunk or this specific question either for that matter. Spam.
wish the moderators can see this conversation, you cannot simply state any site is malicious. That can have such a negative effect on my online reputation.
that offends me. I can assure you it is not malicious, and is my own website, built through months of hard work. I'm not running away anywhere, and it was not automated. Typed it with my own hands.
Don't click. Likely to be malicious website. Already reported.
something like this should help:
rex field=dest_host "^(\w|-)+\.(?<domain>.*)"
Courtesy of the RFC, here's a regex that will extract ALL relevant info from a URL field (single line):
((?<cs_uri_scheme>[^:/?#]+):)?(//(?<cs_uri_authority>[^/?#]*))?(?<cs_uri_stem>[^?#|\s]*)(\?(?<cs_uri_query>[^#|^\s]*))?(#(?<cs_uri_fragment>.*[^\s]))?
Preceded with the following, you can specify field number 'X' in a space-delimited event (in this example, field #4)
(?i)^(?:[^\s]* ){4}
So given the event (single line):
123.123.123.123 2011-07-05 15:20:00 http://example.com/over/there/index.dtb?type=animal&name=narwhal#nose HTTP cht-cdn220-is-3 302
and the regex (single line):
(?i)^(?:[^\s]* ){4}((?<cs_uri_scheme>[^:/?#]+):)?(//(?<cs_uri_authority>[^/?#]*))?(?<cs_uri_stem>[^?#|\s]*)(\?(?<cs_uri_query>[^#|^\s]*))?(#(?<cs_uri_fragment>.*[^\s]))?
The following fields will be extracted via regex:
cs-uri-scheme: | http |
cs-uri-authority: | example.com |
cs-uri-stem: | /over/there/index.dtb |
cs-uri-query: | type=animal&name=narwhal |
cs-uri-fragment: | nose |
I hope this is of some assistance to someone 🙂
Of course, that doesn't answer the part of your question for separating the example from the com, or the example from the co.uk, or any other random gtlds. I think at this point, that requires an enormous, enormous list of possible gtlds.
Thanks to Ziegfried, this got me to where I needed:
"(?<domain>[^\.]+)\.com"
I'll play around with it a bit more and post my results 🙂
rex field=dest_host "(?<domain>[^\.]+)\.([^\.]+)$"
Thanks Ziegfried... although I just realised I left out the option where there is a country code at the end, eg:
Sorry about that 😛