I have SQUID logs, which have URL with domains or IP addresses instead of domains:
google.com/search
217.212.123.211:443
I try to extract field with this regex:
(?:[^#\n]*#){4}(?:http\:\/\/)*(?:[^\/\.]+\.)*(?P<domain_or_ip>(?:(?:\d{1,3}\.){3}\d{1,3})|(?:[^\/\.]+\.[^\/\.]+))(?:\/|\:).*
But it extracts only 2 octets from IP address (with domains all ok). If I try to remove OR-part from regex it works perfectly for IP addresses or for domain names, e.g. for ip-addresses:
(?:[^#\n]*#){4}(?:http\:\/\/)*(?:[^\/\.]+\.)*(?P<ip>(?:\d{1,3}\.){3}\d{1,3})(?:\/|\:).*
And here is a question: in what way I should realize such extraction?
Great thanks @rsennet_splunk for regex101.com advised. I fix my regex with it, and now it captures all domains and IP-addresses in logs:
([^#\n]*#){4}(http\:\/\/)*([^\/\.]+\.)*?(?P<domain_or_ip>((\d{1,3}\.){3}\d{1,3})|([^\/\.]+\.[^\/\.]+))(\/|\:).*
It has one little fix: lazy quantifier instead of greedy in the third capture group: ([^\/\.]+\.)*?
.
Great thanks @rsennet_splunk for regex101.com advised. I fix my regex with it, and now it captures all domains and IP-addresses in logs:
([^#\n]*#){4}(http\:\/\/)*([^\/\.]+\.)*?(?P<domain_or_ip>((\d{1,3}\.){3}\d{1,3})|([^\/\.]+\.[^\/\.]+))(\/|\:).*
It has one little fix: lazy quantifier instead of greedy in the third capture group: ([^\/\.]+\.)*?
.
your example suggests that you always have only the domain name (rather than a hostname as well) so I would just anchor on the http://
no need for all the non capturing groups since you are only going to capture what's in the capturing group:
http:\/\/(?<domain_or_ip>((\d{1,3}.){3}\d{1,3})|([^\/]+))
Anything NOT inside your field capturing group (?<myfield>myregex) is ignored already.
If you do sometimes have a prefix to the domain ie. www.google.com/search then that'll be a bit more complex.
If i've misunderstood... you might want to post a few full log lines. It looks like you're trying to accommodate for a header prefix with the # stuff... and I've not seen that in squid logs. But that could just be me... I'm also not sure if when you have an IP it's really just replacing the domain (not resolved) or if you're talking about it not really being a URI. A few lines of examples would help, especially since your regex is trying to accommodate for the rest of the event... which we can't really see.
Thanks for nice tool advice, I'll try it in a few hours. Here are some messages, which I find in logs: http://pastebin.com/9vj9FtMM - hope, it helps to understand my regex.
the reason I was hoping to see what the events looked like... was to see if there was an anchor to be had. You've used the fields prefixing the one you want to extract... but you don't anchor it on the other side. using the "dot-star" is very greedy. And with all the "sometimes" options... Without some idea of where to end... your regex is canceling itself out in spots. You might want to use a tool like regex101.com I like that one because it works with named capture groups and will step you through all the instructions you've given. See if you can anchor the end as well.
Sorry for unclear question. I can't post full URL due to low karma, so it is not trivial to post full log messages for me. Here are several options in logs: with http://
prefix or without it, with subdomains or without it, with IP-addresses instead domains and in format <domain_or_ip>:443
for https sessions. And yes, first part of regex with #
is for first fields in #-delimited format, and it works correctly. In my opinion my full regex (with OR-parts) works identically to regex with only domain-extraction part instead OR-part. And I can't explain, why it happens.