<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Regex - fqdn, basedomain etc. from URL in CK log in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45179#M10689</link>
    <description>&lt;P&gt;The following may not be the exact solution you need, but it may help you along a little bit. It is using the &lt;CODE&gt;referer_domain&lt;/CODE&gt; field of &lt;CODE&gt;access_combined&lt;/CODE&gt; logs.&lt;/P&gt;

&lt;P&gt;First is the full search that I used, then a slightly more readable version which may not run due to indentation, the a results table.&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;sourcetype="access_combined" | head 10000 | rex field=referer_domain "https?://(?:[\w-]*\.)*?(?&amp;lt;coming_from&amp;gt;((?&amp;lt;ip&amp;gt;[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|(?&amp;lt;nodots&amp;gt;[a-zA-Z0-9-_]+$)|(?&amp;lt;the_domain&amp;gt;([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | table referer_domain ip nodots the_domain coming_from | dedup referer_domain
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Base search&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;sourcetype="access_combined" | head 10000 | 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Start rexing, skip the protocol plus optional hostnames/subdomains&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;rex field=referer_domain "https?://(?:[\w-]*\.)*?
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The field &lt;CODE&gt;coming_from&lt;/CODE&gt; will get the final result&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;coming_from&amp;gt;(
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;find the ip address if any (only ipv4 - add ipv6 if needed)&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;ip&amp;gt;[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;or a plain hostname&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;nodots&amp;gt;[a-zA-Z0-9-_]+$)|
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;or find the domain (with an optional &lt;CODE&gt;.co&lt;/CODE&gt; or &lt;CODE&gt;.com&lt;/CODE&gt; as the second to last part) allowing for up to 5 characters in the top-level (change as needed)&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;the_domain&amp;gt;([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;output the result for testing&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;table referer_domain ip nodots the_domain coming_from | dedup referer_domain
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Expected result:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;referer_domain       ip      nodots   the_domain    coming_from
&lt;A href="http://www.blah.com" target="test_blank"&gt;http://www.blah.com&lt;/A&gt;                   blah.com      blah.com
&lt;A href="https://1.2.3.4" target="test_blank"&gt;https://1.2.3.4&lt;/A&gt;      1.2.3.4                        1.2.3.4
&lt;A href="http://my_srv" target="test_blank"&gt;http://my_srv&lt;/A&gt;                my_srv                 my_srv
&lt;A href="https://a.b.co.uk" target="test_blank"&gt;https://a.b.co.uk&lt;/A&gt;                     b.co.uk       b.co.uk
&lt;A href="http://as.df.jk.edu" target="test_blank"&gt;http://as.df.jk.edu&lt;/A&gt;                   jk.edu        jk.edu
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;I actually haven't tried the nodots function since I didn't have log data to test it on.&lt;/P&gt;

&lt;P&gt;Hope this helps,&lt;/P&gt;

&lt;P&gt;Kristian&lt;/P&gt;</description>
    <pubDate>Tue, 28 Aug 2012 09:09:53 GMT</pubDate>
    <dc:creator>kristian_kolb</dc:creator>
    <dc:date>2012-08-28T09:09:53Z</dc:date>
    <item>
      <title>Regex - fqdn, basedomain etc. from URL in CK log</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45177#M10687</link>
      <description>&lt;P&gt;Hi there,&lt;/P&gt;

&lt;P&gt;I have taken the following regex from here...&lt;/P&gt;

&lt;P&gt;&lt;A href="http://splunk-base.splunk.com/answers/9736/revisiting-regex-to-extract-domain-name-from-an-fqdn/10407"&gt;http://splunk-base.splunk.com/answers/9736/revisiting-regex-to-extract-domain-name-from-an-fqdn/10407&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;And modified it to suit domains such as .com.au, leaving it like:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;domainname&amp;gt;(?&amp;lt;ip&amp;gt;^[A-Fa-f\d\.:]+$)|(?&amp;lt;nodots&amp;gt;^[^\.]+$)|(?&amp;lt;fqdomain&amp;gt;(?:(?:[^\.]+\.)?(?&amp;lt;tld&amp;gt;((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Now, I have data formatted in csv style containing a url string...&lt;BR /&gt;
To extract the domain/ip string from the data, I use this regex:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?i)^(?:[^ ]* ){12}.+://(?P&amp;lt;domain&amp;gt;[^:|,|/]+)[/,]?
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;What I wish to do is create a single regex that will create the domainname,nodata,fqdomain and tld fields from the data extracted using the second extraction of domain.&lt;BR /&gt;
Can someone please help me combine the two extractions to create a single?&lt;BR /&gt;
I'm not the best when it comes to splunk regex...&lt;/P&gt;

&lt;P&gt;Here is some sample data:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;Aug 28 13:05:26 111.111.1.1 28-08-2012; 13:04:48, 26, 111.111.111.11, username@hostname, 125679, 1, text/html, &lt;A href="http://global.ebsco-content.com/interfacefiles/12.4.33.0.2/javascript/bundled/_layout2/master.js" target="test_blank"&gt;http://global.ebsco-content.com/interfacefiles/12.4.33.0.2/javascript/bundled/_layout2/master.js&lt;/A&gt;, default, Educational
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 28 Aug 2012 03:15:21 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45177#M10687</guid>
      <dc:creator>aaronnicoli</dc:creator>
      <dc:date>2012-08-28T03:15:21Z</dc:date>
    </item>
    <item>
      <title>Re: Regex - fqdn, basedomain etc. from URL in CK log</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45178#M10688</link>
      <description>&lt;P&gt;Please help guys, I am really at a loss here &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2012 05:25:39 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45178#M10688</guid>
      <dc:creator>aaronnicoli</dc:creator>
      <dc:date>2012-08-28T05:25:39Z</dc:date>
    </item>
    <item>
      <title>Re: Regex - fqdn, basedomain etc. from URL in CK log</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45179#M10689</link>
      <description>&lt;P&gt;The following may not be the exact solution you need, but it may help you along a little bit. It is using the &lt;CODE&gt;referer_domain&lt;/CODE&gt; field of &lt;CODE&gt;access_combined&lt;/CODE&gt; logs.&lt;/P&gt;

&lt;P&gt;First is the full search that I used, then a slightly more readable version which may not run due to indentation, the a results table.&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;sourcetype="access_combined" | head 10000 | rex field=referer_domain "https?://(?:[\w-]*\.)*?(?&amp;lt;coming_from&amp;gt;((?&amp;lt;ip&amp;gt;[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|(?&amp;lt;nodots&amp;gt;[a-zA-Z0-9-_]+$)|(?&amp;lt;the_domain&amp;gt;([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | table referer_domain ip nodots the_domain coming_from | dedup referer_domain
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Base search&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;sourcetype="access_combined" | head 10000 | 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Start rexing, skip the protocol plus optional hostnames/subdomains&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;rex field=referer_domain "https?://(?:[\w-]*\.)*?
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The field &lt;CODE&gt;coming_from&lt;/CODE&gt; will get the final result&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;coming_from&amp;gt;(
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;find the ip address if any (only ipv4 - add ipv6 if needed)&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;ip&amp;gt;[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$)|
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;or a plain hostname&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;nodots&amp;gt;[a-zA-Z0-9-_]+$)|
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;or find the domain (with an optional &lt;CODE&gt;.co&lt;/CODE&gt; or &lt;CODE&gt;.com&lt;/CODE&gt; as the second to last part) allowing for up to 5 characters in the top-level (change as needed)&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;(?&amp;lt;the_domain&amp;gt;([\w-]+(\.com?\.[a-zA-Z]{2,5}|\.[a-zA-Z]{2,5})$))))" | 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;output the result for testing&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;table referer_domain ip nodots the_domain coming_from | dedup referer_domain
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Expected result:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;referer_domain       ip      nodots   the_domain    coming_from
&lt;A href="http://www.blah.com" target="test_blank"&gt;http://www.blah.com&lt;/A&gt;                   blah.com      blah.com
&lt;A href="https://1.2.3.4" target="test_blank"&gt;https://1.2.3.4&lt;/A&gt;      1.2.3.4                        1.2.3.4
&lt;A href="http://my_srv" target="test_blank"&gt;http://my_srv&lt;/A&gt;                my_srv                 my_srv
&lt;A href="https://a.b.co.uk" target="test_blank"&gt;https://a.b.co.uk&lt;/A&gt;                     b.co.uk       b.co.uk
&lt;A href="http://as.df.jk.edu" target="test_blank"&gt;http://as.df.jk.edu&lt;/A&gt;                   jk.edu        jk.edu
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;I actually haven't tried the nodots function since I didn't have log data to test it on.&lt;/P&gt;

&lt;P&gt;Hope this helps,&lt;/P&gt;

&lt;P&gt;Kristian&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2012 09:09:53 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45179#M10689</guid>
      <dc:creator>kristian_kolb</dc:creator>
      <dc:date>2012-08-28T09:09:53Z</dc:date>
    </item>
    <item>
      <title>Re: Regex - fqdn, basedomain etc. from URL in CK log</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45180#M10690</link>
      <description>&lt;P&gt;updated a typo&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2012 09:11:34 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45180#M10690</guid>
      <dc:creator>kristian_kolb</dc:creator>
      <dc:date>2012-08-28T09:11:34Z</dc:date>
    </item>
    <item>
      <title>Re: Regex - fqdn, basedomain etc. from URL in CK log</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45181#M10691</link>
      <description>&lt;P&gt;Kristian, thanks for the in depth answer, very much appreciate it.&lt;/P&gt;

&lt;P&gt;I have it all running in the search app using rex like you explained, however, my issue is making an "all-in-one" regex that finds the write field in the csv, then runs the "domain" regex on it...&lt;/P&gt;

&lt;P&gt;I think this is what I am after:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;://(?:[\w-]*\.)*?(?&amp;lt;domainname&amp;gt;(?&amp;lt;ip&amp;gt;^[A-Fa-f\d\.:]+$)|(?&amp;lt;nodots&amp;gt;^[^\.]+$)|(?&amp;lt;fqdomain&amp;gt;(?:(?:[^\.]+\.)?(?&amp;lt;tld&amp;gt;((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))$)
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;I'm away from office though at the moment, I will let you know when I'm back.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2012 14:07:33 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45181#M10691</guid>
      <dc:creator>aaronnicoli</dc:creator>
      <dc:date>2012-08-28T14:07:33Z</dc:date>
    </item>
    <item>
      <title>Re: Regex - fqdn, basedomain etc. from URL in CK log</title>
      <link>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45182#M10692</link>
      <description>&lt;P&gt;Okay so...&lt;/P&gt;

&lt;P&gt;Didn't have much luck with the previous response sorry...&lt;/P&gt;

&lt;P&gt;I have been fiddling and seeing what I can come up with, this is what I now have:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;([^,]+, ){7}[^/]+://(?&amp;lt;basedomain&amp;gt;(\[(?&amp;lt;ip6&amp;gt;[^\]]+)\][:/, ])|((?&amp;lt;ip4&amp;gt;\d+(\.\d+){3})[:/, ])|((?&amp;lt;nodots&amp;gt;[^\.,/: ]+)[:,/ ])|(?&amp;lt;fqdomain&amp;gt;(?:(?:[^\.]+\.)?(?&amp;lt;tld&amp;gt;((?:[^\.\s]{3})|(?:[^\.\s]{2}))(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+))))))
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;I understand basedomain is broken, but, as is fqdomain and tld... which for the life of me I can't get to work. (ip4 and ip6 [as well as nodots] work nice though)&lt;/P&gt;

&lt;P&gt;From the sample data in my first post, this is what I expect to see when the following search is run:&lt;/P&gt;

&lt;P&gt;base-search |table basedomain ip4 fqdomain tld&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;ebsco-content.com &amp;lt;blank&amp;gt; global.ebsco-content.com .com
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;I need some regex experts to help me on this...&lt;/P&gt;

&lt;P&gt;Thanks in advance, Aaron.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Aug 2012 01:28:37 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/Regex-fqdn-basedomain-etc-from-URL-in-CK-log/m-p/45182#M10692</guid>
      <dc:creator>aaronnicoli</dc:creator>
      <dc:date>2012-08-29T01:28:37Z</dc:date>
    </item>
  </channel>
</rss>

