This may not be the best place to ask given my issue isn't technically Splunk related, but hopefully I can get some help from people smarter than me anyway.
(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?:://|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>(?:[\p{L}\d\-–]+(?:\.|\[\.\]))+[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)
The above regex is a template I'm working from(lol, I'm not nearly good enough to write this). While it's not too hard to read and see how it works, in a nut shell, it matches on the domain of a URL and nothing else. It does this by first looking for the optional beginning 'https://' and storing that in the 'scheme' group. Following that, it parses the following domain. For example, the URL 'https://community.splunk.com/t5/forums/postpage/board-id/splunk-search' would match 'community.splunk.com'
My issue is that the way it looks for domains following the 'scheme' group requires it use a TLD(.com, .net, etc). Unfortunately, internal services used by my company don't use a TLD, and this causes the regex not to catch them. I need to change this so it can do this.
I want to modify the regex expression above to detect on URLs like: 'https://mysite/resources/rules/123456' wherein the domain would be 'mysite'. I've attempted to do so, but with my limited understanding of how regex really works, my attempts lead to too many matches as shown below.
(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>((?:[\p{L}\d\-–]+(?:\.|\[\.\]))+)?[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)
I tried to throw in an extra non-capturing group within the named 'domain' ground and make the entire first half of the 'domain' group optional, but it leads to matches beyond the domain.
Thank you to whomever may be able to assist. This doesn't feel like it should be such a difficult thing, but it's been vexing me for hours.
Is there any specific reason why you must use your own regex to extract domain? There are much more mature/robust algorithms, including Splunk's built-in transforms, e.g., url:
| extract url
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme
(The url transform results in a field named "domain" that contains both domain and port. This is why I add a second extraction to separate port from domain. It also gives a field "proto" which you call scheme.)
Here are some mock data for you to play with and compare with real data
| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"
``` data emulation above ```
They should give
_raw | domain | port | q | scheme | uri | url |
http://www.google.com/search?q=what%20about%20bob | www.google.com | what%20about%20bob | http | /search?q=what%20about%20bob | http://www.google.com/search?q=what%20about%20bob | |
https://yahoo.com:443/ | yahoo.com | 443 | https | / | https://yahoo.com:443/ | |
ftp://localhost:23/ | localhost | 23 | ftp | / | ftp://localhost:23/ | |
ssh://1234:abcd:::21/ | 1234:abcd:: | 21 | ssh | / | ssh://1234:abcd:::21/ |
Without going into verbose detail, it isn't Splunk which is doing the domain extraction, hence I need to rely on regex.
Understood. When I say to use Splunk's mature/robust solution, it doesn't mean it has to happen inside Splunk. All you need is to use the regex that Splunk has QA tested for you. The regex in Splunk's transformation url is this:
(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s"]*)?)
Here is the same test, except I substitute the transform with the above regex.
| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"
| rex "(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s\"]*)?)"
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme
(Because rex command requires double quote, I have to escape the double quote inside the uri group.) It gives the exact same valid results that you want:
_raw | domain | port | schema | uri | url |
http://www.google.com/search?q=what%20about%20bob | www.google.com | http | /search?q=what%20about%20bob | http://www.google.com/search?q=what%20about%20bob | |
https://yahoo.com:443/ | yahoo.com | 443 | https | / | https://yahoo.com:443/ |
ftp://localhost:23/ | localhost | 23 | ftp | / | ftp://localhost:23/ |
ssh://1234:abcd:::21/ | 1234:abcd:: | 21 | ssh | / | ssh://1234:abcd:::21/ |
You can use below one,
(?i)(?:https?|ftp|hxxp)s?:\/\/(?:www\.)?(?P<domain>[a-zA-Z0-9\-\.]+)
Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!
Hi @dtaylor
You could try with the following:
(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>[\p{L}\d\–]+(?:\.[\p{L}\d\–]+)*)(@|%40)?(?:\b| |[[:punct:]]|$)
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Thank you for offering to help me. I tested your example Regex, and like with your screenshot, it looks like I'm getting a lot more matches for the domain group than just the domain. I need the domain group to only match the domain of a URL. I apologize of this wasn't clear.
The end goal of this is to use the expression in an automation which take in URL's, parse the domain, perform a DNS lookup on the domains, and judge whether a domain is hosted locally based on the IP.
My apologies @dtaylor
Not had my morning coffee yet.. how about this?
(?:http|ftp|hxxp)s?:\/\/([\p{L}\d-]+(?:\.[\p{L}\d-]+)*)
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing