Splunk Search

Convert Regex to Detect Internal Domains

dtaylor
Path Finder

This may not be the best place to ask given my issue isn't technically Splunk related, but hopefully I can get some help from people smarter than me anyway.

 

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?:://|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>(?:[\p{L}\d\-–]+(?:\.|\[\.\]))+[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)

The above regex is a template I'm working from(lol, I'm not nearly good enough to write this). While it's not too hard to read and see how it works, in a nut shell, it matches on the domain of a URL and nothing else. It does this by first looking for the optional beginning 'https://' and storing that in the 'scheme' group. Following that, it parses the following domain.  For example, the URL 'https://community.splunk.com/t5/forums/postpage/board-id/splunk-search' would match 'community.splunk.com'

 

My issue is that the way it looks for domains following the 'scheme' group requires it use a TLD(.com, .net, etc). Unfortunately, internal services used by my company don't use a TLD, and this causes the regex not to catch them. I need to change this so it can do this.

 

I want to modify the regex expression above to detect on URLs like: 'https://mysite/resources/rules/123456' wherein the domain would be 'mysite'. I've attempted to do so, but with my limited understanding of how regex really works, my attempts lead to too many matches as shown below.

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>((?:[\p{L}\d\-–]+(?:\.|\[\.\]))+)?[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)

I tried to throw in an extra non-capturing group within the named 'domain' ground and make the entire first half of the 'domain' group optional, but it leads to matches beyond the domain.

Thank you to whomever may be able to assist. This doesn't feel like it should be such a difficult thing, but it's been vexing me for hours.

Labels (2)
Tags (1)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

Is there any specific reason why you must use your own regex to extract domain?  There are much more mature/robust algorithms, including Splunk's built-in transforms, e.g., url:

| extract url
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme

(The url transform results in a field named "domain" that contains both domain and port. This is why I add a second extraction to separate port from domain. It also gives a field "proto" which you call scheme.)

Here are some mock data for you to play with and compare with real data

| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"
``` data emulation above ```

They should give

_rawdomainportqschemeuriurl
http://www.google.com/search?q=what%20about%20bobwww.google.com what%20about%20bobhttp/search?q=what%20about%20bobhttp://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/yahoo.com443 https/https://yahoo.com:443/
ftp://localhost:23/localhost23 ftp/ftp://localhost:23/
ssh://1234:abcd:::21/1234:abcd::21 ssh/ssh://1234:abcd:::21/
Tags (1)
0 Karma

dtaylor
Path Finder

Without going into verbose detail, it isn't Splunk which is doing the domain extraction, hence I need to rely on regex.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Understood.  When I say to use Splunk's mature/robust solution, it doesn't mean it has to happen inside Splunk.  All you need is to use the regex that Splunk has QA tested for you.  The regex in Splunk's transformation url is this:

(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s"]*)?)

Here is the same test, except I substitute  the transform with the above regex.

| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"

| rex "(?<url>[[alphas:proto]]://(?<domain>[a-zA-Z0-9\-.:]++)(?<uri>/[^\s\"]*)?)"
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme

(Because rex command requires double quote, I have to escape the double quote inside the uri group.)  It gives the exact same valid results that you want:

_rawdomainportschemauriurl
http://www.google.com/search?q=what%20about%20bobwww.google.com http/search?q=what%20about%20bobhttp://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/yahoo.com443https/https://yahoo.com:443/
ftp://localhost:23/localhost23ftp/ftp://localhost:23/
ssh://1234:abcd:::21/1234:abcd::21ssh/ssh://1234:abcd:::21/
0 Karma

Prewin27
Contributor

@dtaylor 

You can use below one,

(?i)(?:https?|ftp|hxxp)s?:\/\/(?:www\.)?(?P<domain>[a-zA-Z0-9\-\.]+)


Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!

0 Karma

livehybrid
Super Champion

Hi @dtaylor 

You could try with the following:

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>[\p{L}\d\–]+(?:\.[\p{L}\d\–]+)*)(@|%40)?(?:\b| |[[:punct:]]|$)

livehybrid_0-1751782045663.png

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

 

0 Karma

dtaylor
Path Finder

Thank you for offering to help me. I tested your example Regex, and like with your screenshot, it looks like I'm getting a lot more matches for the domain group than just the domain. I need the domain group to only match the domain of a URL. I apologize of this wasn't clear.

The end goal of this is to use the expression in an automation which take in URL's, parse the domain, perform a DNS lookup on the domains, and judge whether a domain is hosted locally based on the IP.

0 Karma

livehybrid
Super Champion

My apologies @dtaylor 

Not had my morning coffee yet.. how about this?

(?:http|ftp|hxxp)s?:\/\/([\p{L}\d-]+(?:\.[\p{L}\d-]+)*)

livehybrid_0-1751785660716.png

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

 

0 Karma
Get Updates on the Splunk Community!

AppDynamics Summer Webinars

This summer, our mighty AppDynamics team is cooking up some delicious content on YouTube Live to satiate your ...

SOCin’ it to you at Splunk University

Splunk University is expanding its instructor-led learning portfolio with dedicated Security tracks at .conf25 ...

Credit Card Data Protection & PCI Compliance with Splunk Edge Processor

Organizations handling credit card transactions know that PCI DSS compliance is both critical and complex. The ...