Splunk Search

How to write the regex to extract the domains from URLs?

ccsfdave
Builder

I have been through the field extractor, answers.splunk.com, and the interwebs looking for help on this one. So our Palo Alto will give us the URLs of sites visited - here is a sample:

crl.microsoft.com/pki/crl/products/MicRooCerAut2011_2011_03_22.crl
safebrowsing-cache.google.com/
p4-a2lp5grl52xoy-qpo2s4ky6vs36rpb-794312-s1-v6exp3-v4.metric.gstatic.com/
de.tynt.com/deb/v2?id=dZxfWCGner46jsacwqm_6l&r=lyricstranslate.com/en/l039amour-c039est-pour-rien-love-nothing.html
a248.e.akamai.net/

I would like to be able to extract the domains e.g.

microsoft or microsoft.com
google or google.com
gstatic or gstatic.com
tynt or tynt.com
akamai or akamai.net

I would think that the way to go about it is to look for the FIRST .com, .net, .org etc and then work back to the previous . to grab the domain but that is beyond me.

Can anyone help?

1 Solution

somesoni2
Revered Legend

Try this run anywhere sample

| gentimes start=-1 | eval URL="crl.microsoft.com/pki/crl/products/MicRooCerAut2011_2011_03_22.crl safebrowsing-cache.google.com/ p4-a2lp5grl52xoy-qpo2s4ky6vs36rpb-794312-s1-v6exp3-v4.metric.gstatic.com/ de.tynt.com/deb/v2?id=dZxfWCGner46jsacwqm_6l&r=lyricstranslate.com/en/l039amour-c039est-pour-rien-love-nothing.html a248.e.akamai.net/" | table _raw  | makemv URL| mvexpand URL| rex field=URL "(?<domain>\w+\.\w+)\/"

View solution in original post

somesoni2
Revered Legend

Try this run anywhere sample

| gentimes start=-1 | eval URL="crl.microsoft.com/pki/crl/products/MicRooCerAut2011_2011_03_22.crl safebrowsing-cache.google.com/ p4-a2lp5grl52xoy-qpo2s4ky6vs36rpb-794312-s1-v6exp3-v4.metric.gstatic.com/ de.tynt.com/deb/v2?id=dZxfWCGner46jsacwqm_6l&r=lyricstranslate.com/en/l039amour-c039est-pour-rien-love-nothing.html a248.e.akamai.net/" | table _raw  | makemv URL| mvexpand URL| rex field=URL "(?<domain>\w+\.\w+)\/"

ccsfdave
Builder

@somesoni2

You have it, but help me understand it so that I may apply it to my search. As @Rhin0Crash stated the Palo Altos see the field as "url" so my base search is: index=pan_logs sourcetype=pan* src_ip=x.x.x.x url=*

0 Karma

Rhin0Crash
Path Finder

@ccsfdave :

index=pan_logs sourcetype=pan* src_ip=x.x.x.x url=* | rex field=URL "(?\w+.\w+)\/" | table domain _raw

0 Karma

ccsfdave
Builder

Yup you got it!

| rex field=url "(?<domain>\w+\.\w+)\/"
0 Karma

Rhin0Crash
Path Finder
 search | rex field=_raw "(?<domain>\w+)\.(com|net|gov|edu|co)"

I think

You can replace the field with what field the PA gives you for URL. That might be URL, or misc, or uri.
0 Karma
Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Shape the Future of Splunk: Join the Product Research Lab!

Join the Splunk Product Research Lab and connect with us in the Slack channel #product-research-lab to get ...

Auto-Injector for Everything Else: Making OpenTelemetry Truly Universal

You might have seen Splunk’s recent announcement about donating the OpenTelemetry Injector to the ...