I'm trying to calculate a potential risk score from the number of concurrent consonants in a domain name. (e.g. egorklwqyrjvbsxvhvcws.com is rarely a domain that people intentionally browse... 🙂
So I'm psudo-coding for Splunk in my mind, and I'm envisioning a mess of PCRE regex for assessment criterion that's going to thrash our forwarders and indexers.
Is there a better way to implement the following structure?:
Set (Consonant_Risk_value) = 0%
IF Rex(domain_name)/([bcdfghjklmnpqrstvwxyz]{5})/i OR Rex(domain_name)/([bcdfghjklmnpqrstvwxyz]{6})/I
THEN set (Consonant_Risk_value) = 40%
ELSE
IF Rex(domain_name)/([bcdfghjklmnpqrstvwxyz]{7})/i OR Rex(domain_name)/([bcdfghjklmnpqrstvwxyz]{8})/I
THEN set (Consonant_Risk_value) = 60%
ELSE
IF Rex(domain_name)/([bcdfghjklmnpqrstvwxyz]{>8})/i
THEN set (Consonant_Risk_value) = 80%
For something similar, check out the ut_shannon() function in the URL Toolbox app (https://splunkbase.splunk.com/app/2734/#/details).
Like this:
... | eval Consonant_Risk_value=case((match(domain_name, "[bcdfghjklmnpqrstvwxyz]{9,})/i")), "80%",
((match(domain_name, "[bcdfghjklmnpqrstvwxyz]{7})/i")) OR
(match(domain_name, "[bcdfghjklmnpqrstvwxyz]{8})/I"))), "60%",
((match(domain_name, "[bcdfghjklmnpqrstvwxyz]{5})/i")) OR
(match(domain_name, "[bcdfghjklmnpqrstvwxyz]{6})/I"))), "40%",
true(), "0%")
P.S. Have you heard about Shannon Entropy?
https://www.splunk.com/blog/2016/04/21/when-entropy-meets-shannon/