All Apps and Add-ons

how to apply these trained data models to actual DNS data

thambisetty
SplunkTrust
SplunkTrust

Hi,

First of all thanks for the app and youtube video. I got the results and can I directly use any one the trained models on my DNS data. if yes, how can I apply?

Thanks in advance.

————————————
If this helps, give a like below.
Tags (1)

pdrieger_splunk
Splunk Employee
Splunk Employee

How to apply DGA App to a data source that contains domain names which you want to check for DGA?

Let's assume you have the data indexed in Splunk and the domain name defined with a field extraction (e.g. like https://answers.splunk.com/answers/132364/dns-debug-log-dns-log-format-review.html ) then you should have a field like domain available in your search. Let's further assume you have a existing trained model from the DGA App (check for completed setup dashboard), you can now apply this model to your data in the following easy 3 steps:

  1. Define a search that returns the domain names that you want to check, e.g.
    index=YOUR_DNS_DATA_INDEX domain=* | fields domain | stats count by domain

  2. Then add SPL to run the feature calculation and apply the machine learning model (see dga_evengen saved search in DGA App)
    ...
    | 'ut_shannon(domain)'
    | 'ut_meaning(domain)'
    | eval ut_digit_ratio = 0.0
    | eval ut_vowel_ratio = 0.0
    | eval ut_domain_length = max(1,len(domain))
    | rex field=domain max_match=0 "(?\d)"
    | rex field=domain max_match=0 "(?[aeiou])"
    | eval ut_digit_ratio=if(isnull(digits),0.0,mvcount(digits) / ut_domain_length)
    | eval ut_vowel_ratio=if(isnull(vowels),0.0,mvcount(vowels) / ut_domain_length)
    | eval ut_consonant_ratio = max(0.0, 1.000000 - ut_digit_ratio - ut_vowel_ratio)
    | eval ut_vc_ratio = ut_vowel_ratio / ut_consonant_ratio
    | apply "dga_ngram"
    | apply "dga_pca"
    | apply "dga_randomforest" as class
    | fields - digits - vowels - domain_tfidf* - PC* - ut*
    | where class="dga"

  3. Define an alert or other actions of interest
    Based on the search results above you can now save this as an alert or log the results into a summary index. Depending on your workflow you may want to add additional information like time, asset info etc. which can easily be added to the search above using other indexed fields or enriching with lookups.

Please consider improving your models with retraining on a bigger dataset (e.g. retrain on the dga_test dataset contained in the DGA App) and incorporating environment specifics into your training dataset. For better performance you should consider deduplication and filtering of the domain names to prevent wasting CPU cycles on running models on a priori known or to be excluded domains.

alt text

0 Karma

kimikoyan
Explorer

Hi buddy,
I followed your answer and find it does return a result list with dga tag. But it seems too many false positive dga domains...My SPL is listed below:

index=nids event_type=dns earliest=-5min
|stats count by query|rename query as domain|fields domain
| eval _time=now()-(60.000*random()/2147483647)
| table _time domain
| `ut_shannon(domain)` 
| `ut_meaning(domain)` 
| eval ut_digit_ratio = 0.0 
| eval ut_vowel_ratio = 0.0 
| eval ut_domain_length = max(1,len(domain))  
| rex field=domain max_match=0 "(?<digits>\d)" 
| rex field=domain max_match=0 "(?<vowels>[aeiou])" 
| eval ut_digit_ratio=if(isnull(digits),0.0,mvcount(digits) / ut_domain_length) 
| eval ut_vowel_ratio=if(isnull(vowels),0.0,mvcount(vowels) / ut_domain_length) 
| eval ut_consonant_ratio = max(0.0, 1.000000 - ut_digit_ratio - ut_vowel_ratio)  
| eval ut_vc_ratio = ut_vowel_ratio / ut_consonant_ratio 
| apply "dga_ngram" 
| apply "dga_pca" 
| apply "dga_randomforest" as class
| fields - digits - vowels - domain_tfidf*
| where class="dga"

I also tried to improve my models with retraining on dga_test dataset. But the final results is not so satisfied.
How can I put these real dns domains to DGA app and then use step 4 to check and adjust my detected DGA classified domain names for further black/white listing and future learning ?

0 Karma

pdrieger_splunk
Splunk Employee
Splunk Employee

Hi @kimikoyan great to hear you were able to apply it to your data. Just notice you may not want to override the _time with random times, you can keep your original timestamp if you want. Please keep in mind the training datasets with the app are supposed to mainly work with . patterns. If you see subdomains in your data this obviously distorts the results if you don't filter them before (e.g. using ut_parse from URL toolbox). I would recommend to build your own training data set by merging the existing dga_test set with your environment specific legit domains and ideally add more dga domains of generators that you want to detect. You can rerun all steps you need from the app with your own dataset to build a better model and improve on your false positive rates.

0 Karma

bestSplunker
Contributor

@pdrieger_splunk Is there a detailed reference manual?

0 Karma

kimikoyan
Explorer

Thank you for your reply. What do mean by saying ". patterns" ? Do you mean I should filter out top level domains from my dns traffic and then apply the trained dga model to tld domains ?

0 Karma

pdrieger_splunk
Splunk Employee
Splunk Employee

Hi thambisetty, simply put, you need to calculate the features on your DNS data which your models are based on and then you can apply your trained models straight after. However you also want to consider deduplication of your domains (after URL parsing) to prevent redundant computations in applying the model. Let me know if this is helpful or you need more details. Best, Philipp

0 Karma

adam_dixon95
Explorer

Hi,

Could you expand on this?

I'm looking at applying this data to our DNS logs in real use-cases - though this descrpition isn't helping much.

Thanks

0 Karma

kimikoyan
Explorer

I have the same question... The DGA setup seems just to create a model. But how to use this model checking my real dns traffic ? We don't know the how...

0 Karma

pdrieger_splunk
Splunk Employee
Splunk Employee

@kimikoyan , @adam_dixon95 see detailed answer posted below.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...