Splunk Search

Convert Regex to Detect Internal Domains

dtaylor
Path Finder

This may not be the best place to ask given my issue isn't technically Splunk related, but hopefully I can get some help from people smarter than me anyway.

 

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?:://|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>(?:[\p{L}\d\-–]+(?:\.|\[\.\]))+[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)

The above regex is a template I'm working from(lol, I'm not nearly good enough to write this). While it's not too hard to read and see how it works, in a nut shell, it matches on the domain of a URL and nothing else. It does this by first looking for the optional beginning 'https://' and storing that in the 'scheme' group. Following that, it parses the following domain.  For example, the URL 'https://community.splunk.com/t5/forums/postpage/board-id/splunk-search' would match 'community.splunk.com'

 

My issue is that the way it looks for domains following the 'scheme' group requires it use a TLD(.com, .net, etc). Unfortunately, internal services used by my company don't use a TLD, and this causes the regex not to catch them. I need to change this so it can do this.

 

I want to modify the regex expression above to detect on URLs like: 'https://mysite/resources/rules/123456' wherein the domain would be 'mysite'. I've attempted to do so, but with my limited understanding of how regex really works, my attempts lead to too many matches as shown below.

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>((?:[\p{L}\d\-–]+(?:\.|\[\.\]))+)?[\p{L}]{2,})(@|%40)?(?:\b| |[[:punct:]]|$)

I tried to throw in an extra non-capturing group within the named 'domain' ground and make the entire first half of the 'domain' group optional, but it leads to matches beyond the domain.

Thank you to whomever may be able to assist. This doesn't feel like it should be such a difficult thing, but it's been vexing me for hours.

Labels (2)
Tags (1)
0 Karma

yuanliu
SplunkTrust
SplunkTrust

Is there any specific reason why you must use your own regex to extract domain?  There are much more mature/robust algorithms, including Splunk's built-in transforms, e.g., url:

| extract url
| rex field=domain "(?<domain>.+)(?::(?<port>\d+))$"
| rename proto as scheme

(The url transform results in a field named "domain" that contains both domain and port. This is why I add a second extraction to separate port from domain. It also gives a field "proto" which you call scheme.)

Here are some mock data for you to play with and compare with real data

| makeresults format=csv data="_raw
http://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/
ftp://localhost:23/
ssh://1234:abcd:::21/"
``` data emulation above ```

They should give

_rawdomainportqschemeuriurl
http://www.google.com/search?q=what%20about%20bobwww.google.com what%20about%20bobhttp/search?q=what%20about%20bobhttp://www.google.com/search?q=what%20about%20bob
https://yahoo.com:443/yahoo.com443 https/https://yahoo.com:443/
ftp://localhost:23/localhost23 ftp/ftp://localhost:23/
ssh://1234:abcd:::21/1234:abcd::21 ssh/ssh://1234:abcd:::21/
Tags (1)
0 Karma

Prewin27
Contributor

@dtaylor 

You can use below one,

(?i)(?:https?|ftp|hxxp)s?:\/\/(?:www\.)?(?P<domain>[a-zA-Z0-9\-\.]+)


Regards,
Prewin
Splunk Enthusiast | Always happy to help! If this answer helped you, please consider marking it as the solution or giving a Karma. Thanks!

0 Karma

livehybrid
Super Champion

Hi @dtaylor 

You could try with the following:

(?i)(?P<scheme>(?:http|ftp|hxxp)s?(?::\/\/|-3A__|%3A%2F%2F))?(?:%[\da-f][\da-f])?(?P<domain>[\p{L}\d\–]+(?:\.[\p{L}\d\–]+)*)(@|%40)?(?:\b| |[[:punct:]]|$)

livehybrid_0-1751782045663.png

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

 

0 Karma

dtaylor
Path Finder

Thank you for offering to help me. I tested your example Regex, and like with your screenshot, it looks like I'm getting a lot more matches for the domain group than just the domain. I need the domain group to only match the domain of a URL. I apologize of this wasn't clear.

The end goal of this is to use the expression in an automation which take in URL's, parse the domain, perform a DNS lookup on the domains, and judge whether a domain is hosted locally based on the IP.

0 Karma

livehybrid
Super Champion

My apologies @dtaylor 

Not had my morning coffee yet.. how about this?

(?:http|ftp|hxxp)s?:\/\/([\p{L}\d-]+(?:\.[\p{L}\d-]+)*)

livehybrid_0-1751785660716.png

🌟 Did this answer help you? If so, please consider:

  • Adding karma to show it was useful
  • Marking it as the solution if it resolved your issue
  • Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

 

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to July Tech Talks, Office Hours, and Webinars!

What are Community Office Hours?Community Office Hours is an interactive 60-minute Zoom series where ...

Updated Data Type Articles, Anniversary Celebrations, and More on Splunk Lantern

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

A Prelude to .conf25: Your Guide to Splunk University

Heading to Boston this September for .conf25? Get a jumpstart by arriving a few days early for Splunk ...