I have some real DNS data obtained from IDS, I can get it by the following search statement
index = ids sourcetype=suricata event_type=dns | table _time src_ip domain
I have read Operationalize Machine Learning
part of the dga app for splunk
Setup notes:
1、Create an index that holds domain names and computed features (we used a index named "dga_proxy")
2、Activate scheduled searches (app menu: More > Alerts) to generate sample data and fill this index.
3、Check the macro domain_input
in Settings > Advanced Search if you have custom naming
Following the instructions above, I did the following:
1、create an index , the index is named dga_prod
2、create a scheduled searches alert, the spl as follows:
index=ids event_type=dns
|stats latest(_time) as _time,values(src_ip) as src_ip,values(dest_ip) as dest_ip,values(dns.answer{}.rrtype) as type,values(dns.type) as dns_type,values(asset_name) as asset_name count by domain
| 'ut_shannon(domain)'
| 'ut_meaning(domain)'
| eval ut_digit_ratio = 0.0
| eval ut_vowel_ratio = 0.0
| eval ut_domain_length = max(1,len(domain))
| rex field=domain max_match=0 "(?\d)"
| rex field=domain max_match=0 "(?[aeiou])"
| eval ut_digit_ratio=if(isnull(digits),0.0,mvcount(digits) / ut_domain_length)
| eval ut_vowel_ratio=if(isnull(vowels),0.0,mvcount(vowels) / ut_domain_length)
| eval ut_consonant_ratio = max(0.0, 1.000000 - ut_digit_ratio - ut_vowel_ratio)
| eval ut_vc_ratio = ut_vowel_ratio / ut_consonant_ratio
| apply "dga_ngram"
| apply "dga_pca"
| apply "dga_randomforest" as class
| fields - digits - vowels - domain_tfidf*
|collect index = dga_prod
this alert like dga_eventgen
, run every minute to fill dga_prod
index.
3、edit domain_input
macro , modify deafult index = dga_proxy
to index=dga_prod
I have some questions:
1.Am I doing it correctly?
2.how to solve false positive, I see that some very normal domain names are also detected as dga, for example: my company's domain name brower.360.cn
, www.xmind.cn
,http.kali.org
etc.... do I need add it to whitelist and how to do it?
3.I can't find more related dag app for splunk
documents, videos, manuals, etc. I also just learned to use MLTK.
Hi @bestSplunker ,
first of all good news for your very first question: basically you're doing everything correctly! 🙂
In your summary index=dga_prod you have collected the results which contain the classification results dga/legit for each domain. For improving your false positives, you can consider 4 options:
1. you can add new domains to the existing training data to train your classifier model on more information to improve it.
2. you can use the prepared KVStore (| inputlookup dga_known_domains) and manually readjust the label dga/legit with the help of the buttons mechanism on the 4. operationalize dashboard. With this mechanism you can build up additional data that you can merge with the existing training data (like in points 1. and 4.) to retrain the classification model and improve it over time.
3. you should definitely consider to curate a whitelist, e.g. in a lookup, to exclude by definition known (legit) domain names from running the classifier on.
4. last but not least, which dataset did you train on? the DGA App for Splunk also ships with a 1.7M dataset (| inputlookup dga_test) that you can calculate features and build the model on.
For a basic walkthrough I assume you already checked the YouTube video: https://www.youtube.com/watch?v=1ctPStvI3BY
I hope those options make sense and are helpful for further improvements for you?
@pdrieger_splunk
thank you very much for your reply,according to the second method you mentioned, do I need to modify the SPL of dga_feedback_kvstroe
, modify the index = dga_proxy
to domain_input
, so that I can manually adjust the false-positive dga domain to legit, and for the fourth point, I don't quite understand how to do it. forgive me for just learned use MLTK