All Apps and Add-ons

How to apply real data to dga app for splunk for training


I have some real DNS data obtained from IDS, I can get it by the following search statement
index = ids sourcetype=suricata event_type=dns | table _time src_ip domain

I have read Operationalize Machine Learning part of the dga app for splunk

Setup notes:

1、Create an index that holds domain names and computed features (we used a index named "dgaproxy")
2、Activate scheduled searches (app menu: More > Alerts) to generate sample data and fill this index.
3、Check the macro `domain
input` in Settings > Advanced Search if you have custom naming

Following the instructions above, I did the following:

1、create an index , the index is named dga_prod
2、create a scheduled searches alert, the spl as follows:

 index=ids event_type=dns 
|stats latest(_time) as _time,values(src_ip) as src_ip,values(dest_ip) as dest_ip,values(dns.answer{}.rrtype) as type,values(dns.type) as dns_type,values(asset_name) as asset_name count by domain
| 'ut_shannon(domain)'
| 'ut_meaning(domain)'
| eval ut_digit_ratio = 0.0 
| eval ut_vowel_ratio = 0.0 
| eval ut_domain_length = max(1,len(domain)) 
| rex field=domain max_match=0 "(?\d)" 
| rex field=domain max_match=0 "(?[aeiou])" 
| eval ut_digit_ratio=if(isnull(digits),0.0,mvcount(digits) / ut_domain_length) 
| eval ut_vowel_ratio=if(isnull(vowels),0.0,mvcount(vowels) / ut_domain_length) 
| eval ut_consonant_ratio = max(0.0, 1.000000 - ut_digit_ratio - ut_vowel_ratio) 
| eval ut_vc_ratio = ut_vowel_ratio / ut_consonant_ratio 
| apply "dga_ngram" 
| apply "dga_pca"
| apply "dga_randomforest" as class
 | fields - digits - vowels - domain_tfidf*  
|collect index = dga_prod

this alert like dga_eventgen, run every minute to fill dga_prod index.

3、edit domain_input macro , modify deafult index = dga_proxy to index=dga_prod

I have some questions:

1.Am I doing it correctly? to solve false positive, I see that some very normal domain names are also detected as dga, for example: my company's domain name,, etc.... do I need add it to whitelist and how to do it?
3.I can't find more related dag app for splunk documents, videos, manuals, etc. I also just learned to use MLTK.

0 Karma

Re: How to apply real data to dga app for splunk for training

Splunk Employee
Splunk Employee

Hi @bestSplunker ,

first of all good news for your very first question: basically you're doing everything correctly! 🙂

In your summary index=dgaprod you have collected the results which contain the classification results dga/legit for each domain. For improving your false positives, you can consider 4 options:
1. you can add new domains to the existing training data to train your classifier model on more information to improve it.
2. you can use the prepared KVStore (| inputlookup dga
knowndomains) and manually readjust the label dga/legit with the help of the buttons mechanism on the 4. operationalize dashboard. With this mechanism you can build up additional data that you can merge with the existing training data (like in points 1. and 4.) to retrain the classification model and improve it over time.
3. you should definitely consider to curate a whitelist, e.g. in a lookup, to exclude by definition known (legit) domain names from running the classifier on.
4. last but not least, which dataset did you train on? the DGA App for Splunk also ships with a 1.7M dataset (| inputlookup dga
test) that you can calculate features and build the model on.

For a basic walkthrough I assume you already checked the YouTube video:

I hope those options make sense and are helpful for further improvements for you?

0 Karma

Re: How to apply real data to dga app for splunk for training


thank you very much for your reply,according to the second method you mentioned, do I need to modify the SPL of `dga
feedbackkvstroe, modify theindex = dgaproxytodomain_input`, so that I can manually adjust the false-positive dga domain to legit, and for the fourth point, I don't quite understand how to do it. forgive me for just learned use MLTK

0 Karma