All Apps and Add-ons

Field extraction domain name OR IP address from URL

temperuser
Explorer

I have SQUID logs, which have URL with domains or IP addresses instead of domains:

google.com/search

217.212.123.211:443

I try to extract field with this regex:

(?:[^#\n]*#){4}(?:http\:\/\/)*(?:[^\/\.]+\.)*(?P<domain_or_ip>(?:(?:\d{1,3}\.){3}\d{1,3})|(?:[^\/\.]+\.[^\/\.]+))(?:\/|\:).*

But it extracts only 2 octets from IP address (with domains all ok). If I try to remove OR-part from regex it works perfectly for IP addresses or for domain names, e.g. for ip-addresses:

(?:[^#\n]*#){4}(?:http\:\/\/)*(?:[^\/\.]+\.)*(?P<ip>(?:\d{1,3}\.){3}\d{1,3})(?:\/|\:).*

And here is a question: in what way I should realize such extraction?

1 Solution

temperuser
Explorer

Great thanks @rsennet_splunk for regex101.com advised. I fix my regex with it, and now it captures all domains and IP-addresses in logs:

 ([^#\n]*#){4}(http\:\/\/)*([^\/\.]+\.)*?(?P<domain_or_ip>((\d{1,3}\.){3}\d{1,3})|([^\/\.]+\.[^\/\.]+))(\/|\:).*

It has one little fix: lazy quantifier instead of greedy in the third capture group: ([^\/\.]+\.)*?.

View solution in original post

temperuser
Explorer

Great thanks @rsennet_splunk for regex101.com advised. I fix my regex with it, and now it captures all domains and IP-addresses in logs:

 ([^#\n]*#){4}(http\:\/\/)*([^\/\.]+\.)*?(?P<domain_or_ip>((\d{1,3}\.){3}\d{1,3})|([^\/\.]+\.[^\/\.]+))(\/|\:).*

It has one little fix: lazy quantifier instead of greedy in the third capture group: ([^\/\.]+\.)*?.

rsennett_splunk
Splunk Employee
Splunk Employee

your example suggests that you always have only the domain name (rather than a hostname as well) so I would just anchor on the http://

no need for all the non capturing groups since you are only going to capture what's in the capturing group:


http:\/\/(?<domain_or_ip>((\d{1,3}.){3}\d{1,3})|([^\/]+))


Anything NOT inside your field capturing group (?<myfield>myregex) is ignored already.

If you do sometimes have a prefix to the domain ie. www.google.com/search then that'll be a bit more complex.

If i've misunderstood... you might want to post a few full log lines. It looks like you're trying to accommodate for a header prefix with the # stuff... and I've not seen that in squid logs. But that could just be me... I'm also not sure if when you have an IP it's really just replacing the domain (not resolved) or if you're talking about it not really being a URI. A few lines of examples would help, especially since your regex is trying to accommodate for the rest of the event... which we can't really see.

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

temperuser
Explorer

Thanks for nice tool advice, I'll try it in a few hours. Here are some messages, which I find in logs: http://pastebin.com/9vj9FtMM - hope, it helps to understand my regex.

0 Karma

rsennett_splunk
Splunk Employee
Splunk Employee

the reason I was hoping to see what the events looked like... was to see if there was an anchor to be had. You've used the fields prefixing the one you want to extract... but you don't anchor it on the other side. using the "dot-star" is very greedy. And with all the "sometimes" options... Without some idea of where to end... your regex is canceling itself out in spots. You might want to use a tool like regex101.com I like that one because it works with named capture groups and will step you through all the instructions you've given. See if you can anchor the end as well.

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

temperuser
Explorer

Sorry for unclear question. I can't post full URL due to low karma, so it is not trivial to post full log messages for me. Here are several options in logs: with http:// prefix or without it, with subdomains or without it, with IP-addresses instead domains and in format <domain_or_ip>:443 for https sessions. And yes, first part of regex with # is for first fields in #-delimited format, and it works correctly. In my opinion my full regex (with OR-parts) works identically to regex with only domain-extraction part instead OR-part. And I can't explain, why it happens.

0 Karma
Get Updates on the Splunk Community!

Detecting Remote Code Executions With the Splunk Threat Research Team

WATCH NOWRemote code execution (RCE) vulnerabilities pose a significant risk to organizations. If exploited, ...

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

The Splunk Community Dashboard Challenge is underway! This is your chance to showcase your skills in creating ...

.conf24 | Session Scheduler is Live!!

.conf24 is happening June 11 - 14 in Las Vegas, and we are thrilled to announce that the conference catalog ...