I got TLD lookups working, but sans URL Toolbox.
Here's what I did, in case anyone else is interested:
1) Construct a lookup table from a suffix list of your choice.
I used the mozilla list. Looking it over, I saw that the max level of elements in a TLD was four. I had to construct the lookup table in the correct order, so that, for example, FQDNs in com.mx would see that TLD, before the mx TLD. I also noticed that a few TLDs themselves had MX records, and so could conceivably show up in logs, without any other prepended element.
Thus, the strategy of adding entries in the order of:
*.four.three.two.one
four.three.two.one
*three.two.one
three.two.one
*two.one
two.one
*one
one
I also grepped out the 8bit stuff, at least in this iteration.
This is the script I came up with to generate the lookup table:
#!/bin/bash
TLDFILE=suffix_list_mozilla.dat
echo rev,levels
# get 4-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*\.[a-z-]*\.[a-z-]*'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,4/g; s/^/*./g'
done
# get 3-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*\.[a-z-]*'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,3/g; s/^/*./g'
done
# get 2-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]*\.[a-z-]*'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,2/g; s/^/*./g'
done
# get 1-level tlds
for DOMAIN in `grep -v -P "[^\11\12\40-\176]" $TLDFILE | grep -E ^'[a-z-]{2,}'$`
do
echo $DOMAIN",0"
echo $DOMAIN | sed 's/$/,1/g; s/^/*./g'
done
2) Add the lookup table to splunk via the gui.
Upload the lookup table file produced by the previous step.
(I saved mine as suffix_list.cv)
Add a lookup definition, making sure to set default match to 0, max matches to 1, and min matches to 0.
(I added it as tld_suffix)
3) Enable wildcards for the first field. Edit transforms.conf, adding the WILDCARD line as in the example below:
[tld_suffix]
default_match = 0
filename = suffix_list.csv
max_matches = 1
min_matches = 0
match_type = WILDCARD(rev)
4) Restart splunkd
So it sees the updated transforms.conf
5) Run the search.
I added a couple rex sed commands to take care of some anomalies I was seeing in rDNS (*.somefqdn.com, and something_other.otherfqdn.com). Since both are illegal as part of a domain name (latter legit only for SRV records, I think), this seemed safe.
Since 2-element TLDs are the most common in logs here, the 1-element, then 3 and then 4, the 'eval dn=' bits in the search are thus ordered.
If there is no match found in the lookup table, the search preserves the original FQDN. I may change this later for my own purposes here, ymmv.
The bits in the middle are what do the work, the rest is wrapper to illustrate an example search.
index = our_index rev="*"
| fields rev
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
| eval fqdn = lower(rev)
| rex field=fqdn mode=sed "s/\*\.//g"
| rex field=fqdn mode=sed "s/_/-/g"
| lookup tld_suffix rev AS fqdn
| eval dn=if(levels == 2, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 1, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 3, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), if(levels == 4, replace(fqdn, "^([a-z0-9-\.]+\.)([a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+\.[a-z0-9-]+$)", "\2"), dn))))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| stats count by levels,fqdn,dn
6) Enjoy your new "reduce FQDN to domain" tool 🙂 Adjust
This does exactly what I wanted. I thought URL Toolbox would too, but maybe something odd in the recent version, or python, or some other reason prevented it doing that.
... View more