Solved: Re: Convert Regex to Detect Internal Domains

dtaylor · ‎07-05-2025

This may not be the best place to ask given my issue isn't technically Splunk related, but hopefully I can get some help from people smarter than me anyway.

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?:://|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>(?:[\p{L}\d\-–]+(?:\.|\[\.\]))+[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)

The above regex is a template I'm working from(lol, I'm not nearly good enough to write this). While it's not too hard to read and see how it works, in a nut shell, it matches on the domain of a URL and nothing else. It does this by first looking for the optional beginning 'https://' and storing that in the 'scheme' group. Following that, it parses the following domain. For example, the URL 'https://community.splunk.com/t5/forums/postpage/board-id/splunk-search' would match 'community.splunk.com'

My issue is that the way it looks for domains following the 'scheme' group requires it use a TLD(.com, .net, etc). Unfortunately, internal services used by my company don't use a TLD, and this causes the regex not to catch them. I need to change this so it can do this.

I want to modify the regex expression above to detect on URLs like: 'https://mysite/resources/rules/123456' wherein the domain would be 'mysite'. I've attempted to do so, but with my limited understanding of how regex really works, my attempts lead to too many matches as shown below.

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>((?:[\p{L}\d\-–]+(?:\.|\[\.\]))+)?[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)

I tried to throw in an extra non-capturing group within the named 'domain' ground and make the entire first half of the 'domain' group optional, but it leads to matches beyond the domain.

Thank you to whomever may be able to assist. This doesn't feel like it should be such a difficult thing, but it's been vexing me for hours.

yuanliu · ‎07-11-2025

Understood. When I say to use Splunk's mature/robust solution, it doesn't mean it has to happen inside Splunk. All you need is to use the regex that Splunk has QA tested for you. The regex in Splunk's transformation url is this:

(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s"]*)?)

Here is the same test, except I substitute the transform with the above regex.

| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"

| rex "(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s\"]*)?)"
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme

(Because rex command requires double quote, I have to escape the double quote inside the uri group.) It gives the exact same valid results that you want:

_raw	domain	port	schema	uri	url
http://www.google.com/search?q=what%20about%20bob	www.google.com		http	/search?q=what%20about%20bob	http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/	yahoo.com	443	https	/	https://yahoo.com:443/
ftp://localhost:23/	localhost	23	ftp	/	ftp://localhost:23/
ssh://1234:abcd:::21/	1234:abcd::	21	ssh	/	ssh://1234:abcd:::21/

View solution in original post

yuanliu · ‎07-06-2025

Is there any specific reason why you must use your own regex to extract domain? There are much more mature/robust algorithms, including Splunk's built-in transforms, e.g., url:

| extract url
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme

(The url transform results in a field named "domain" that contains both domain and port. This is why I add a second extraction to separate port from domain. It also gives a field "proto" which you call scheme.)

Here are some mock data for you to play with and compare with real data

| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"
``` data emulation above ```

They should give

_raw	domain	port	q	scheme	uri	url
http://www.google.com/search?q=what%20about%20bob	www.google.com		what%20about%20bob	http	/search?q=what%20about%20bob	http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/	yahoo.com	443		https	/	https://yahoo.com:443/
ftp://localhost:23/	localhost	23		ftp	/	ftp://localhost:23/
ssh://1234:abcd:::21/	1234:abcd::	21		ssh	/	ssh://1234:abcd:::21/

dtaylor · ‎07-11-2025

Without going into verbose detail, it isn't Splunk which is doing the domain extraction, hence I need to rely on regex.

yuanliu · ‎07-11-2025

Understood. When I say to use Splunk's mature/robust solution, it doesn't mean it has to happen inside Splunk. All you need is to use the regex that Splunk has QA tested for you. The regex in Splunk's transformation url is this:

(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s"]*)?)

Here is the same test, except I substitute the transform with the above regex.

| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"

| rex "(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s\"]*)?)"
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme

(Because rex command requires double quote, I have to escape the double quote inside the uri group.) It gives the exact same valid results that you want:

_raw	domain	port	schema	uri	url
http://www.google.com/search?q=what%20about%20bob	www.google.com		http	/search?q=what%20about%20bob	http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/	yahoo.com	443	https	/	https://yahoo.com:443/
ftp://localhost:23/	localhost	23	ftp	/	ftp://localhost:23/
ssh://1234:abcd:::21/	1234:abcd::	21	ssh	/	ssh://1234:abcd:::21/

PrewinThomas · ‎07-06-2025

@dtaylor

You can use below one,

(?i)(?:https?|ftp|hxxp)s?:\/\/(?:www\.)?(?P<domain>[a-zA-Z0-9\-\.]+)

Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!

livehybrid · ‎07-05-2025

Hi @dtaylor

You could try with the following:

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>[\p{L}\d\–]+(?:\.[\p{L}\d\–]+)*)(@|%40)?(?:\b| |[[:punct:]]|$)

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

dtaylor · ‎07-05-2025

Thank you for offering to help me. I tested your example Regex, and like with your screenshot, it looks like I'm getting a lot more matches for the domain group than just the domain. I need the domain group to only match the domain of a URL. I apologize of this wasn't clear.

The end goal of this is to use the expression in an automation which take in URL's, parse the domain, perform a DNS lookup on the domains, and judge whether a domain is hosted locally based on the IP.

livehybrid · ‎07-06-2025

My apologies @dtaylor

Not had my morning coffee yet.. how about this?

(?:http|ftp|hxxp)s?:\/\/([\p{L}\d-]+(?:\.[\p{L}\d-]+)*)

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Convert Regex to Detect Internal Domains

regex

rex

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation

Convert Regex to Detect Internal Domains

regex

rex

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...