Solved: How do I look for domains that aren't in the Alexa...

MonkeyK · ‎08-16-2017

I am trying to find non-alexa top 1 million domain requests.

I am getting alexa_by_str.csv from https://s3.amazonaws.com/alexa-static/top-1m.csv.zip.
The file contains domain and rank like
domain, rank
google.com,1
youtube.com,2
facebook.com,3
...

well, clearly a lookup on this would not take subdomains into account. So
| lookup alexa_by_str.csv OUTPUT rank
does not match on www.google.com

So I tried modifying the lookup with a wildcard
|inputlookup alexa_by_str.csv |eval domain="*.".domain | outputlookup |inputlookup wc_alexa_by_str.csv
(and created a lookup defn with WILDCARDS(domains))

but then searches result in Error
"Error using lookup table 'wc_alexa_lookup_by_str': CIDR and wildcard matching is restricted to lookup files under the in-memory size limit."

So I can increase the "max_memtable_bytes" in limits.conf so something huge (200M?) but I have to wonder if this is the right approach. Since Splunk ES downloads alexa top 1million by default, maybe it has some other way of dealing with subdomains.

Is there a better way of matching alexa top1M domains (or rejecting them) than to rewrite alexa_by_str.csv with wildcards and then increase max_memtable_bytes?

MonkeyK · ‎09-27-2017

For the benefit of answering this question, I will point out that between DalJeanis and I, we came up with the following query:

 index=pan_logs earliest=-1s log_subtype=url 
 | rex field=dest_hostname "(?P<ld2>[\w_-]+\.[\w_-]+)$" 
 | rex field=dest_hostname "(?P<ld3>[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
 | rex field=dest_hostname "(?P<ld4>[\w_-]+\.[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
 | `alexa_lookup(ld2)` 
 | `alexa_lookup(ld3)` 
 | `alexa_lookup(ld4)`  
 | eval rank=coalesce(ld4_rank, ld3_rank, ld2_rank, "unknown") 
 | eval ld=case(ld4_rank>0,"4",ld3_rank>0,"3",ld2_rank>0,"2",1=1,"NA") 
 |  table dest_hostname rank ld

DalJeanis also noted that the inverselookup (not present in the Alexa top 1M) can be done as follows:

  index=pan_logs earliest=-1s log_subtype=url 
  | rex field=dest_hostname "(?P<ld2>[\w_-]+\.[\w_-]+)$" 
  | `alexa_lookup(ld2)` 
  | where isnull(ld2_rank) OR ld2_rank>1000000

  | rex field=dest_hostname "(?P<ld3>[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
  | `alexa_lookup(ld3)` 
  | where isnull(ld3_rank) OR coalesce(ld3_rank,ld2_rank)>1000000

  | rex field=dest_hostname "(?P<ld4>[\w_-]+\.[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
  | `alexa_lookup(ld4)`  
  | where isnull(ld4_rank) OR coalesce(ld4_rank,ld3_rank,ld2_rank)>1000000

  | eval rank=coalesce(ld4_rank, ld3_rank, ld2_rank, "unknown") 
  | eval ld=case(ld4_rank>0,"4",ld3_rank>0,"3",ld2_rank>0,"2",1=1,"NA") 
  |  table dest_hostname rank ld

View solution in original post

MonkeyK · ‎09-27-2017

For the benefit of answering this question, I will point out that between DalJeanis and I, we came up with the following query:

 index=pan_logs earliest=-1s log_subtype=url 
 | rex field=dest_hostname "(?P<ld2>[\w_-]+\.[\w_-]+)$" 
 | rex field=dest_hostname "(?P<ld3>[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
 | rex field=dest_hostname "(?P<ld4>[\w_-]+\.[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
 | `alexa_lookup(ld2)` 
 | `alexa_lookup(ld3)` 
 | `alexa_lookup(ld4)`  
 | eval rank=coalesce(ld4_rank, ld3_rank, ld2_rank, "unknown") 
 | eval ld=case(ld4_rank>0,"4",ld3_rank>0,"3",ld2_rank>0,"2",1=1,"NA") 
 |  table dest_hostname rank ld

DalJeanis also noted that the inverselookup (not present in the Alexa top 1M) can be done as follows:

  index=pan_logs earliest=-1s log_subtype=url 
  | rex field=dest_hostname "(?P<ld2>[\w_-]+\.[\w_-]+)$" 
  | `alexa_lookup(ld2)` 
  | where isnull(ld2_rank) OR ld2_rank>1000000

  | rex field=dest_hostname "(?P<ld3>[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
  | `alexa_lookup(ld3)` 
  | where isnull(ld3_rank) OR coalesce(ld3_rank,ld2_rank)>1000000

  | rex field=dest_hostname "(?P<ld4>[\w_-]+\.[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
  | `alexa_lookup(ld4)`  
  | where isnull(ld4_rank) OR coalesce(ld4_rank,ld3_rank,ld2_rank)>1000000

  | eval rank=coalesce(ld4_rank, ld3_rank, ld2_rank, "unknown") 
  | eval ld=case(ld4_rank>0,"4",ld3_rank>0,"3",ld2_rank>0,"2",1=1,"NA") 
  |  table dest_hostname rank ld

dmenon84 · ‎05-16-2018

Hi - Can you please give some guidance on how you created the macros? Thanks in advance

DalJeanis · ‎08-16-2017

Why don't you just cut down the event domain field you are using to look it up?

MonkeyK · ‎08-16-2017

by how much? domains can have up to 127 levels (https://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_formulation)
for example all of the below are real domains
state.oh.us
sos.state.oh.us
www.sos.state.oh.us
www.dot.state.oh.us

I think that it is difficult to cut it by a variable amount, especially when I don't know how deep subdomains in the list may go.
I don't have the download in front of me, but if I go here: http://www.alexa.com/topsites
I can see domains listed that are 2 and 3 deep; so assuming that everything is just a TLD with a second level domain seems unreasonable.

DalJeanis · ‎08-18-2017

They are real domains, but how does alexa file them? And what do you mean when you say domains that aren't in the top million? By my question, I mean, if a subdomain $4.$3.$2.$1 is not in the list, but $2.$1 is in the list, does that count as top million or not?

In your example, I would think that state.oh.us would be what you check.

Mostly you want these

   xxx.yyy
   xxx.co.yyy
   xxx.yyy.edu
   state.xxx.us

so that simplifies to {ignore.}$3.$2.$1:

if $2="co" OR $1 = "edu" OR $1="us" then use 3 nodes otherwise use 2 nodes.

You might even eliminate all the .us ones anyway, because whether or not they are in the top million, your business use case probably doesn't care...

MonkeyK · ‎08-21-2017

figured I get the depth counts:
|inputlookup alexa_by_str.csv
| makemv delim="." domain
| eval depth=mvcount(domain)
| stats count by depth

depth count
2 865923
3 133808
4 2865

MonkeyK · ‎08-21-2017

Also, to prove that my way of thinking about the alexa top 1M is correct, we can look for some common domains that should be included:

google.com
|inputlookup alexa_by_str.csv | search domain="google"

all results start with "google", no calendar.google.com, maps.google.com, mail.google.com: all domains that should clearly be in the top 1M.

I conclude that Alexa intends for us to match on subdomains.

DalJeanis · ‎08-21-2017

So that simplifies to this -

whatever your URL you want to check, reduce it to the last 3 nodes before the slashes

{ignore.}$3.$2.$1.//etc :

if $2="co" OR $1 = "edu" OR $1="us" then use 3 nodes for your lookup $3.$2.$1
otherwise use 2 nodes for your lookup $2.$1.

MonkeyK · ‎08-21-2017

hmmm... good point. Since alexa only has 2,3, and 4 levels deep, I guess that I just need to compare possible second, third, and fourth level domains.

Fortunately, Splunk's alexa macro makes this easier

index=pan_logs earliest=-1s log_subtype=url 
| rex field=dest_hostname "(?P<ld2>[\w_-]+\.[\w_-]+)$" 
| rex field=dest_hostname "(?P<ld3>[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
| rex field=dest_hostname "(?P<ld4>[\w_-]+\.[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
| `alexa_lookup(ld2)` 
| `alexa_lookup(ld3)` 
| `alexa_lookup(ld4)`  
| eval rank=coalesce(ld4_rank, ld3_rank, ld2_rank, "unknown") 
| eval ld=case(ld4_rank>0,"4",ld3_rank>0,"3",ld2_rank>0,"2",1=1,"NA") 
|  table dest_hostname rank ld

I guess that I should wrap the Alexa macro with my own.

Also 3 extractions and 3 lookup maches seems super inefficient. Too bad Splunk doesn't handle this directly

DalJeanis · ‎08-22-2017

@MonkeyK - Well, if it comes to enhancement requests, this ones not high on my list. I'd much rather they be spending their development hours on a couple of other things...

If you are just looking for items that are NOT in the top million, then you can make it more efficient by the order... especially since the most common URLs in Alexa will also tend to be the most common in your events and possibly the shortest as well...

 index=pan_logs earliest=-1s log_subtype=url 
 | rex field=dest_hostname "(?P<ld2>[\w_-]+\.[\w_-]+)$" 
 | `alexa_lookup(ld2)` 
 | where isnull(ld2_rank) OR ld2_rank>1000000

 | rex field=dest_hostname "(?P<ld3>[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
 | `alexa_lookup(ld3)` 
 | where isnull(ld3_rank) OR coalesce(ld3_rank,ld2_rank)>1000000

 | rex field=dest_hostname "(?P<ld4>[\w_-]+\.[\w_-]+\.[\w_-]+\.[\w_-]+)$" 
 | `alexa_lookup(ld4)`  
 | where isnull(ld4_rank) OR coalesce(ld4_rank,ld3_rank,ld2_rank)>1000000

 | eval rank=coalesce(ld4_rank, ld3_rank, ld2_rank, "unknown") 
 | eval ld=case(ld4_rank>0,"4",ld3_rank>0,"3",ld2_rank>0,"2",1=1,"NA") 
 |  table dest_hostname rank ld

MonkeyK · ‎08-29-2017

DalJeanis, I agree that I would not rank this near the top of my enhancement desires.

As I have researched this issue, I have lost a lot of faith in Alexa, but have replaced that with Cisco's umbrella list:
Very different sources.

Alexa –
• top 1 million second level domains
• web browsing
• ranked by total volume
• sourced from Alexa browser extension
Umbrella –
• top 1 million fully qualified domains
• DNS queries
• ranked by unique requesting IP address
• sourced from OpenDNS

Knowing this, I prefer the Umbrella list.

Interestingly, both lists may contain a significant number of malicious domains
http://www.netresec.com/?page=Blog&month=2017-04&post=Domain-Whitelist-Benchmark%3A-Alexa-vs-Umbrell...

As fully qualified domains, Umbrella includes the subdomains in their rankings:

|inputlookup umbrella_by_str.csv
| makemv delim="." domain
| eval depth=mvcount(domain)
| stats count by depth

depth count
1 1173
2 263236
3 477335
4 181147
5 58011
6 14829
7 3454
8 772
9 306
10 19
11 31
12 1

The only really odd thing about the Umbrella list is that it includes clear errors and unresolvable domains. It can be interesting to see that some of the most widely used domains are unresolvable (for example, baidu.com443 is ranked 230796 despite being a clear typo)

MonkeyK · ‎08-18-2017

you ask,

if a subdomain $4.$3.$2.$1 is not in
the list, but $2.$1does that count as
top million or not?

I would count subdomains as in the list. Basically, they are domains controlled by a (relatively) well known domain.

since alexa top 1m includes "state.oh.us"

I would like it to make it match on
sos.state.oh.us
www.sos.state.oh.us
www.dot.state.oh.us

by doing this I could reject those to focus tracking of traffic to more rare domains and start building out a list of new domains visited from my network.

mattymo · ‎08-17-2017

hmm, this seem cool. might give it a try. I think I would index the list then use some eval magic. Will let you know if I find anything useful.

- MattyMo

How do I look for domains that aren't in the Alexa top 1 million?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!