Splunk Tech Talks
Deep-dives for technical practitioners.

ML in Security: Elevate Your DGA Detection Game

WhitneySink
Splunk Employee
Splunk Employee

Threat research shows that a large percentage of organizations experience DNS attacks. Often, adversaries dynamically generate domain names using Domain Generation Algorithms (DGA) to create C2 infrastructure not prone to static analysis disruption.

The DGA Deep Learning pre-trained model, recently developed by the Splunk Machine Learning for Security team, processes complex domain patterns along with custom features capturing characteristics of a domain. The detection, used with a simple “apply” command, identifies DGA domains with 99.37% accuracy.

Highlights:

  • The complexity of DGA threats
  • The motivation for a Deep Learning based detection
  • Differentiation in performance accuracy
  • Deployment of DGA detection in Splunk
adepp
Splunk Employee
Splunk Employee

Don't forget, we're hosting TWO Community Office Hour sessions as a follow-up to this Tech Talk! Register here:

This is your opportunity to get live, hands-on help from our Splunk ML and Threat Research experts. Register for the sessions above to ask questions and get help on your specific ML challenge or use case.

WhitneySink
Splunk Employee
Splunk Employee
WhitneySink
Splunk Employee
Splunk Employee

Tech Talk Q&A

Q: What dataset did you use to train your model?

A: We collected data domain 15M DNS related data over the time - domains generated by known DGA families, top/frequently visited domains and CDN generated domains. The dataset was split into 90% for training and remaining 20% for validation and test datasets.

 

Q: Splunk MLTK app just works based on the sample data, how can we configure it to work on our data?

A: The search that uses the pretrained model is available in ES Content updates. the detection reads domain related data from Network Resolution data model. It takes query , fully qualified domain name as an input and generates risk events for domains classified as DGA. This pre-trained model can also be used out of ESCU context like in a SPL search using an index/lookup as input and piping with | apply pretrained_dga_model_dsdl. Additionally, you can also use it outside Splunk by using the Jupyter notebook.

 

Q: Is this module only available on ES accounts?

A: This model is available for all accounts. This is a pretrained model and can be used with SPL search using | apply pretrained_dga_model_dsdl. The model accepts domain, a fully qualified domain name as input.

 

Q: I am new to ML and is it possible to create correlation rule using ML?

A: Yes, you could use MLTK functions like density function inside of correlation searches.

 

Q: Whats the best resource for learning ML for Security Purposes for Splunk?

A: You could search for DSDL / MLTK under https://research.splunk.com/ to find detections that leverage MLTK/DSDL. The team also publishes blog under name (Machine Learning in Security: ) with a detailed walkthrough of detection, model architecture, performance etc.

 

Q: Are this models trained on a fixed set of data? Does the model get trained on new data? Do users have the possibility to train these models on their own data?

A: Yes, the model has been trained on fixed data and it is available as a pre-trianed model. However the model can be updated with your data. This will require implementing fit() method in the Jupyter notebook and performing all the preprocessing for the training similar to apply() method.

 

Q: Do you have any statistics on how well the model performs (FN/FP rates) on new DGA domains that's not part of your training/test data, but on real-world data?

A: The field team is closely working with customers to deploy the DGA model. The deep learning model yields 82% accuracy on their benchmark data (not test/train) . The performance for the deep learning model is however the highest among all solutions (including DGA App for Splunk)

 

Q:  Any training step by step available?

A: The step by step procedure is outlined in this blog https://www.splunk.com/en_us/blog/security/machine-learning-in-security-deep-learning-based-dga-dete....

 

Q: Is this part of the risk based alerting (rba) module?

A: Yes, the ESCU generates risk events on finding DGA generated domains. The risk score for each event is at 63 since there is 90% confidence and 70 as the impact. Based on the risk incident rules in ES, the notables are generated for risk events. The ESCU search could also modified to generate notables instead of risk events.

 

Q: Follow up question on creation on correlation rule , We have found DGA generated domains using MLA , how can I use this ML to trigger an correlation alert when a DGA is found in our environment OR how can I have this data in investigation in ES

A: The correlation search should be able to generate risk events / notables. Modify the Risk Analysis section for the correlation search.

 

Q: Can I use the pretrained models outside of Splunk? Can I run the models locally on my python environment?

A: Yes, Absolutely. The models are pre-trained and available for download. However there is some preprocessing logic required to compute custom features required for inferencing on live data. Please take a look at the apply() method in the Jupyter notebook and ensure new features are also passed to the model.

 

Get Updates on the Splunk Community!

What the End of Support for Splunk Add-on Builder Means for You

Hello Splunk Community! We want to share an important update regarding the future of the Splunk Add-on Builder ...

Solve, Learn, Repeat: New Puzzle Channel Now Live

Welcome to the Splunk Puzzle PlaygroundIf you are anything like me, you love to solve problems, and what ...

Building Reliable Asset and Identity Frameworks in Splunk ES

 Accurate asset and identity resolution is the backbone of security operations. Without it, alerts are ...