Hi there,
I have taken the following regex from here...
And modified it to suit domains such as .com.au, leaving it like:
(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)
Now, I have data formatted in csv style containing a url string...
To extract the domain/ip string from the data, I use this regex:
(?i)^(?:[^ ]* ){12}.+://(?P<domain>[^:|,|/]+)[/,]?
What I wish to do is create a single regex that will create the domainname,nodata,fqdomain and tld fields from the data extracted using the second extraction of domain.
Can someone please help me combine the two extractions to create a single?
I'm not the best when it comes to splunk regex...
Here is some sample data:
Aug 28 13:05:26 111.111.1.1 28-08-2012; 13:04:48, 26, 111.111.111.11, username@hostname, 125679, 1, text/html, http://global.ebsco-content.com/interfacefiles/12.4.33.0.2/javascript/bundled/_layout2/master.js, default, Educational
Okay so...
Didn't have much luck with the previous response sorry...
I have been fiddling and seeing what I can come up with, this is what I now have:
([^,]+, ){7}[^/]+://(?<basedomain>(\[(?<ip6>[^\]]+)\][:/, ])|((?<ip4>\d+(\.\d+){3})[:/, ])|((?<nodots>[^\.,/: ]+)[:,/ ])|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+))))))
I understand basedomain is broken, but, as is fqdomain and tld... which for the life of me I can't get to work. (ip4 and ip6 [as well as nodots] work nice though)
From the sample data in my first post, this is what I expect to see when the following search is run:
base-search |table basedomain ip4 fqdomain tld
ebsco-content.com <blank> global.ebsco-content.com .com
I need some regex experts to help me on this...
Thanks in advance, Aaron.
The following may not be the exact solution you need, but it may help you along a little bit. It is using the referer_domain
field of access_combined
logs.
First is the full search that I used, then a slightly more readable version which may not run due to indentation, the a results table.
sourcetype="access_combined" | head 10000 | rex field=referer_domain "https?://(?:[\w-]*\.)*?(?<coming_from>((?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|(?<nodots>[a-zA-Z0-9-_]+$)|(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | table referer_domain ip nodots the_domain coming_from | dedup referer_domain
Base search
sourcetype="access_combined" | head 10000 |
Start rexing, skip the protocol plus optional hostnames/subdomains
rex field=referer_domain "https?://(?:[\w-]*\.)*?
The field coming_from
will get the final result
(?<coming_from>(
find the ip address if any (only ipv4 - add ipv6 if needed)
(?<ip>[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|
or a plain hostname
(?<nodots>[a-zA-Z0-9-_]+$)|
or find the domain (with an optional .co
or .com
as the second to last part) allowing for up to 5 characters in the top-level (change as needed)
(?<the_domain>([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" |
output the result for testing
table referer_domain ip nodots the_domain coming_from | dedup referer_domain
Expected result:
referer_domain ip nodots the_domain coming_from
http://www.blah.com blah.com blah.com
https://1.2.3.4 1.2.3.4 1.2.3.4
http://my_srv my_srv my_srv
https://a.b.co.uk b.co.uk b.co.uk
http://as.df.jk.edu jk.edu jk.edu
I actually haven't tried the nodots function since I didn't have log data to test it on.
Hope this helps,
Kristian
Kristian, thanks for the in depth answer, very much appreciate it.
I have it all running in the search app using rex like you explained, however, my issue is making an "all-in-one" regex that finds the write field in the csv, then runs the "domain" regex on it...
I think this is what I am after:
://(?:[\w-]*\.)*?(?<domainname>(?<ip>^[A-Fa-f\d\.:]+$)|(?<nodots>^[^\.]+$)|(?<fqdomain>(?:(?:[^\.]+\.)?(?<tld>((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)
I'm away from office though at the moment, I will let you know when I'm back.
updated a typo
Please help guys, I am really at a loss here 😞