Hi,
I'm using the Dec 17, 2015 version of URL Toolbox on Splunk 6.3.2. It's installed and working, but... not very well with some TLDs that have multiple levels.
Here's one example. rev holds the rDNS of a connecting IP. I want to reduce all the rDNS patterns to the list of domains.
index = our_index rev="*mx" | fields rev | eval lrev = lower(rev) | eval rev = lrev | dedup rev | eval list="mozilla" | lookup ut_parse_extended_lookup url AS rev| search ut_domain="net.mx"
ut_domain in the example below should be megared.net.mx, yes?
Type
Field Value Actions
Event
lrev
customer-gdl-168-245.megared.net.mx
rev
customer-gdl-168-245.megared.net.mx
ut_domain
net.mx
ut_domain_without_tld
net
ut_fragment
None
ut_netloc
customer-gdl-168-245.megared.net.mx
ut_params
None
ut_path
None
ut_port
80
ut_query
None
ut_scheme
None
ut_subdomain
customer-gdl-168-245.megared
ut_subdomain_count
2
ut_subdomain_parts
{"ut_subdomain_level_1": "megared", "ut_subdomain_level_2": "customer-gdl-168-245"}
ut_tld
mx
Time
_time
2016-04-13T18:05:36.884+00:00
In contrast, this one worked fine, for rev=ale.nubehost.mx, ut_domain=nubehost.mx
lrev
ale.nubehost.mx
rev
ale.nubehost.mx
ut_domain
nubehost.mx
ut_domain_without_tld
nubehost
ut_fragment
None
ut_netloc
ale.nubehost.mx
ut_params
None
ut_path
None
ut_port
80
ut_query
None
ut_scheme
None
ut_subdomain
ale
ut_subdomain_count
1
ut_subdomain_parts
{"ut_subdomain_level_1": "ale"}
ut_tld
mx
Time
_time
2016-04-13T18:08:56.336+00:00
Any ideas why this is happening? URL Toolbox would be ideal for what I need, if the answers it was giving me made a little more sense 🙂
Thanks,
Dave
xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
* Mozilla List: TLD is .com.mx, so the domain is cableonline.com.mx
* IANA List: TLD is .mx, so the domain is com.mx
This works as expected.
What you are asking for is that the tool take the longest one on both lists, which is a new feature. Thanks for the idea 🙂
Happy to help make the tool better 🙂
I figured out how to do what I needed sans the app, using a lookup table and a creative spunk search pipeline. Waiting for it to be approved by a moderator...
I got TLD lookups working, but sans URL Toolbox.
Here's what I did, in case anyone else is interested:
1) Construct a lookup table from a suffix list of your choice.
I used the mozilla list. Looking it over, I saw that the max level of elements in a TLD was four. I had to construct the lookup table in the correct order, so that, for example, FQDNs in com.mx would see that TLD, before the mx TLD. I also noticed that a few TLDs themselves had MX records, and so could conceivably show up in logs, without any other prepended element.
Thus, the strategy of adding entries in the order of:
*.four.three.two.one
four.three.two.one
*three.two.one
three.two.one
*two.one
two.one
*one
one
I also grepped out the 8bit stuff, at least in this iteration.
This is the script I came up with to generate the lookup table:
#!/bin/bash
TLDFILE=suffix_list_mozilla.dat
echo rev,levels
# get 4-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*\.[a-z-]*\.[a-z-]*'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,4/g; s/^/*./g'
done
# get 3-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*\.[a-z-]*'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,3/g; s/^/*./g'
done
# get 2-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,2/g; s/^/*./g'
done
# get 1-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]{2,}'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,1/g; s/^/*./g'
done
2) Add the lookup table to splunk via the gui.
Upload the lookup table file produced by the previous step.
(I saved mine as suffix_list.cv)
Add a lookup definition, making sure to set default match to 0, max matches to 1, and min matches to 0.
(I added it as tld_suffix)
3) Enable wildcards for the first field. Edit transforms.conf, adding the WILDCARD line as in the example below:
[tld_suffix]
default_match = 0
filename = suffix_list.csv
max_matches = 1
min_matches = 0
match_type = WILDCARD(rev)
4) Restart splunkd
So it sees the updated transforms.conf
5) Run the search.
I added a couple rex sed commands to take care of some anomalies I was seeing in rDNS (*.somefqdn.com, and something_other.otherfqdn.com). Since both are illegal as part of a domain name (latter legit only for SRV records, I think), this seemed safe.
Since 2-element TLDs are the most common in logs here, the 1-element, then 3 and then 4, the 'eval dn=' bits in the search are thus ordered.
If there is no match found in the lookup table, the search preserves the original FQDN. I may change this later for my own purposes here, ymmv.
The bits in the middle are what do the work, the rest is wrapper to illustrate an example search.
index = our_index rev="*"
| fields rev
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
| eval fqdn = lower(rev)
| rex field=fqdn mode=sed "s/\*\.//g"
| rex field=fqdn mode=sed "s/_/-/g"
| lookup tld_suffix rev AS fqdn
| eval dn=if(levels == 2, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 1, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 3, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 4, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), dn))))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| stats count by levels,fqdn,dn
6) Enjoy your new "reduce FQDN to domain" tool 🙂 Adjust
This does exactly what I wanted. I thought URL Toolbox would too, but maybe something odd in the recent version, or python, or some other reason prevented it doing that.
As described by the Mozilla Suffix List, .net.mx is a TLD. As well as others like .com.mx, or .org.mx
If you only want to parse the first part, set the list to iana or to custom if you have your own custom list.
Yes, but... shouldn't URL Toolbox be smart enough to know that when using the Mozilla list, both an fqdn of foo.company.com.mx, the ut_domain comes out to company.com.mx, and a fqdn of ale.nubehost.mx results in an ut_domain of nubehost.mx?
Here's another example, using iana as you suggest.
index = our_index rev="*mx" | fields rev | eval lrev = lower(rev) | eval rev = lrev | dedup rev | eval list="iana" | lookup ut_parse_extended_lookup url AS rev
rev is xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
Shouldn't ut_domain be cableonline.com.mx? Instead it's com.mx again.
Type
Field Value Actions
Event
lrev
xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
rev
xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
ut_domain
com.mx
ut_domain_without_tld
com
ut_fragment
None
ut_netloc
xxx.xxx.xxx.xx.cable.dyn.cableonline.com.mx
ut_params
None
ut_path
None
ut_port
80
ut_query
None
ut_scheme
None
ut_subdomain
xxx.xxx.xxx.xx8.cable.dyn.cableonline
ut_subdomain_count
7
ut_subdomain_parts
{"ut_subdomain_level_4": "58", "ut_subdomain_level_5": "106", "ut_subdomain_level_6": "239", "ut_subdomain_level_7": "177", "ut_subdomain_level_1": "cableonline", "ut_subdomain_level_2": "dyn", "ut_subdomain_level_3": "cable"}
ut_tld
mx
Time
_time
2016-04-13T19:25:19.294+00:00
From the Mozilla list:
mx
com.mx
org.mx
gob.mx
edu.mx
net.mx
blogspot.mx
From the iana list:
MX
My apologies if I'm misunderstanding something.