All Apps and Add-ons

URL Toolbox not correctly parsing some TLDs

davelugo
New Member

Hi,

I'm using the Dec 17, 2015 version of URL Toolbox on Splunk 6.3.2. It's installed and working, but... not very well with some TLDs that have multiple levels.

Here's one example. rev holds the rDNS of a connecting IP. I want to reduce all the rDNS patterns to the list of domains.

index = our_index rev="*mx" | fields rev | eval lrev = lower(rev) | eval rev = lrev | dedup rev | eval list="mozilla" | lookup ut_parse_extended_lookup url AS rev| search ut_domain="net.mx"

ut_domain in the example below should be megared.net.mx, yes?

Type

Field Value Actions
Event

lrev
customer-gdl-168-245.megared.net.mx
rev
customer-gdl-168-245.megared.net.mx
ut_domain
net.mx

ut_domain_without_tld
net
ut_fragment
None

ut_netloc
customer-gdl-168-245.megared.net.mx
ut_params
None

ut_path
None

ut_port
80

ut_query
None

ut_scheme
None

ut_subdomain
customer-gdl-168-245.megared

ut_subdomain_count
2

ut_subdomain_parts
{"ut_subdomain_level_1": "megared", "ut_subdomain_level_2": "customer-gdl-168-245"}
ut_tld
mx

Time

_time
2016-04-13T18:05:36.884+00:00

In contrast, this one worked fine, for rev=ale.nubehost.mx, ut_domain=nubehost.mx

lrev
ale.nubehost.mx
rev
ale.nubehost.mx
ut_domain
nubehost.mx
ut_domain_without_tld
nubehost

ut_fragment
None

ut_netloc
ale.nubehost.mx
ut_params
None

ut_path
None

ut_port
80

ut_query
None

ut_scheme
None

ut_subdomain
ale
ut_subdomain_count
1

ut_subdomain_parts
{"ut_subdomain_level_1": "ale"}
ut_tld
mx

Time

_time
2016-04-13T18:08:56.336+00:00

Any ideas why this is happening? URL Toolbox would be ideal for what I need, if the answers it was giving me made a little more sense 🙂

Thanks,

Dave

Tags (1)
0 Karma

cleroux_splunk
Splunk Employee
Splunk Employee

xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
* Mozilla List: TLD is .com.mx, so the domain is cableonline.com.mx
* IANA List: TLD is .mx, so the domain is com.mx
This works as expected.

What you are asking for is that the tool take the longest one on both lists, which is a new feature. Thanks for the idea 🙂

0 Karma

davelugo
New Member

Happy to help make the tool better 🙂

I figured out how to do what I needed sans the app, using a lookup table and a creative spunk search pipeline. Waiting for it to be approved by a moderator...

0 Karma

davelugo
New Member

I got TLD lookups working, but sans URL Toolbox.

Here's what I did, in case anyone else is interested:

1) Construct a lookup table from a suffix list of your choice.

I used the mozilla list. Looking it over, I saw that the max level of elements in a TLD was four. I had to construct the lookup table in the correct order, so that, for example, FQDNs in com.mx would see that TLD, before the mx TLD. I also noticed that a few TLDs themselves had MX records, and so could conceivably show up in logs, without any other prepended element.

Thus, the strategy of adding entries in the order of:

*.four.three.two.one
four.three.two.one
*three.two.one
three.two.one
*two.one
two.one
*one
one

I also grepped out the 8bit stuff, at least in this iteration.

This is the script I came up with to generate the lookup table:

#!/bin/bash

TLDFILE=suffix_list_mozilla.dat

echo rev,levels

# get 4-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*\.[a-z-]*\.[a-z-]*'$`
 do
  echo $DOMAIN",0"
  echo $DOMAIN | sed 's/$/,4/g; s/^/*./g'
done

# get 3-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*\.[a-z-]*'$`
 do
  echo $DOMAIN",0"
  echo $DOMAIN | sed 's/$/,3/g; s/^/*./g'
done

# get 2-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*'$`
 do
  echo $DOMAIN",0"
  echo $DOMAIN | sed 's/$/,2/g; s/^/*./g'
done

# get 1-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]{2,}'$`
 do
  echo $DOMAIN",0"
  echo $DOMAIN | sed 's/$/,1/g; s/^/*./g'
done

2) Add the lookup table to splunk via the gui.

Upload the lookup table file produced by the previous step.

(I saved mine as suffix_list.cv)

Add a lookup definition, making sure to set default match to 0, max matches to 1, and min matches to 0.

(I added it as tld_suffix)

3) Enable wildcards for the first field. Edit transforms.conf, adding the WILDCARD line as in the example below:

[tld_suffix]
default_match = 0
filename = suffix_list.csv
max_matches = 1
min_matches = 0
match_type = WILDCARD(rev)

4) Restart splunkd

So it sees the updated transforms.conf

5) Run the search.

I added a couple rex sed commands to take care of some anomalies I was seeing in rDNS (*.somefqdn.com, and something_other.otherfqdn.com). Since both are illegal as part of a domain name (latter legit only for SRV records, I think), this seemed safe.

Since 2-element TLDs are the most common in logs here, the 1-element, then 3 and then 4, the 'eval dn=' bits in the search are thus ordered.

If there is no match found in the lookup table, the search preserves the original FQDN. I may change this later for my own purposes here, ymmv.

The bits in the middle are what do the work, the rest is wrapper to illustrate an example search.

index = our_index rev="*"
| fields rev 

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
    | eval fqdn = lower(rev)
    | rex field=fqdn mode=sed "s/\*\.//g"
    | rex field=fqdn mode=sed "s/_/-/g"
    | lookup tld_suffix rev AS fqdn
    | eval dn=if(levels == 2, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 1, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 3, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 4, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), dn))))
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

| stats count by levels,fqdn,dn

6) Enjoy your new "reduce FQDN to domain" tool 🙂 Adjust

This does exactly what I wanted. I thought URL Toolbox would too, but maybe something odd in the recent version, or python, or some other reason prevented it doing that.

0 Karma

cleroux_splunk
Splunk Employee
Splunk Employee

As described by the Mozilla Suffix List, .net.mx is a TLD. As well as others like .com.mx, or .org.mx

If you only want to parse the first part, set the list to iana or to custom if you have your own custom list.

davelugo
New Member

Yes, but... shouldn't URL Toolbox be smart enough to know that when using the Mozilla list, both an fqdn of foo.company.com.mx, the ut_domain comes out to company.com.mx, and a fqdn of ale.nubehost.mx results in an ut_domain of nubehost.mx?

Here's another example, using iana as you suggest.

index = our_index rev="*mx" | fields rev | eval lrev = lower(rev) | eval rev = lrev | dedup rev | eval list="iana" | lookup ut_parse_extended_lookup url AS rev

rev is xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx

Shouldn't ut_domain be cableonline.com.mx? Instead it's com.mx again.

Type

Field Value Actions
Event

lrev
xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
rev
xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
ut_domain
com.mx

ut_domain_without_tld
com
ut_fragment
None

ut_netloc
xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
ut_params
None

ut_path
None

ut_port
80

ut_query
None

ut_scheme
None

ut_subdomain
xxx.xxx.xxx.xx8.cable.dyn.cableonline

ut_subdomain_count
7

ut_subdomain_parts
{"ut_subdomain_level_4": "58", "ut_subdomain_level_5": "106", "ut_subdomain_level_6": "239", "ut_subdomain_level_7": "177", "ut_subdomain_level_1": "cableonline", "ut_subdomain_level_2": "dyn", "ut_subdomain_level_3": "cable"}

ut_tld
mx

Time

_time
2016-04-13T19:25:19.294+00:00

From the Mozilla list:

mx
com.mx
org.mx
gob.mx
edu.mx
net.mx
blogspot.mx

From the iana list:

MX

My apologies if I'm misunderstanding something.

0 Karma
Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!