All Apps and Add-ons

In Splunk Machine Learning Toolkit, how to use this baseline to predict if the incoming email is phishing or not?

KrithikaRamakri
Explorer

Hi everyone, I am new to Splunk. I was trying to use Splunk Machine learning toolkit for Phishing emails detection. I have created a baseline with around 3000 odd phishing emails and 1000 odd legitimate emails. I would like to use this baseline to predict if the incoming email is phishing or not. But, when I apply "Predict Categorical Fields", it drops all the fields with this error-"Dropping fields with too many distinct values". How can this be achieved with Splunk? Can someone please help me on this?

0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

Hi @KrithikaRamakrishnan,

Most likely, your email data is under some textual field, say email_text. When you go to do some analysis on this textual data, the machine learning algorithms require that input gets converted into some type of numeric representation.

The Machine Learning Toolkit will try to help you out here, by automatically converting categorical variables, like email_text, into numeric fields. The most common and easiest way to do this is called "dummy encoding" or sometimes called "one-hot" encoding. The MLTK does this by using the panda's get_dummies.

Lets look at a simpler example. Let's say we have a field called color, and it has three values:

color
red
green
blue

If we left it at this, as a categorical field, behind the scenes the MLTK would convert this categorical field into a one-hot representation like this:

color_red, color_green, color_blue
    1       0           0
    0       1           0
    0       0           1

For a simple field with a moderate number of discrete vales, this strategy actually works pretty well despite its simplicity.

Let's go back to the email example. When this type of encoding happens on a field where each value of email_text is unique, you basically get a huge sparse matrix of 0's with these occasional ones. In this case, each event isn't really being represented by anything useful for machine learning, and by default, the MLTK will throw an error.

When you need to create features out of a textual variable like email_text, the preprocessing that you need to do is most likely vectorization of the text. You can read more about vectorization here. Thankfully, the MLTK already ships with a vectorizer algorithm, term-frequency inverse-document frequency (TFIDF) vectorization.

So,

... base search ...
| fit TFIDF email_text

Will create 100 fields that are based on the TFIDF weighting of word occurrences. You can look into the parameters on the docs page linked to try featuring on characters versus words (e.g. analyzer=char or ngram_range=1-3, max_features=200).

You'll see that the TFIDF algorithm will then turn your text into a useful representation for machine learning, rather than the useless representation you get (for this kind of data) with the one-hot/dummy encoding.

Hope this helps 🙂 if you need more help, try asking on the #machinelearning channel on the Splunk user groups slack - read more about that here.


P.S. If however you do want to change the limit that is imposed, the maximum number of distinct values can be modified in mlspl.conf: see https://docs.splunk.com/Documentation/MLApp/3.3.0/User/Configurefitandapply

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

Hi @KrithikaRamakrishnan,

Most likely, your email data is under some textual field, say email_text. When you go to do some analysis on this textual data, the machine learning algorithms require that input gets converted into some type of numeric representation.

The Machine Learning Toolkit will try to help you out here, by automatically converting categorical variables, like email_text, into numeric fields. The most common and easiest way to do this is called "dummy encoding" or sometimes called "one-hot" encoding. The MLTK does this by using the panda's get_dummies.

Lets look at a simpler example. Let's say we have a field called color, and it has three values:

color
red
green
blue

If we left it at this, as a categorical field, behind the scenes the MLTK would convert this categorical field into a one-hot representation like this:

color_red, color_green, color_blue
    1       0           0
    0       1           0
    0       0           1

For a simple field with a moderate number of discrete vales, this strategy actually works pretty well despite its simplicity.

Let's go back to the email example. When this type of encoding happens on a field where each value of email_text is unique, you basically get a huge sparse matrix of 0's with these occasional ones. In this case, each event isn't really being represented by anything useful for machine learning, and by default, the MLTK will throw an error.

When you need to create features out of a textual variable like email_text, the preprocessing that you need to do is most likely vectorization of the text. You can read more about vectorization here. Thankfully, the MLTK already ships with a vectorizer algorithm, term-frequency inverse-document frequency (TFIDF) vectorization.

So,

... base search ...
| fit TFIDF email_text

Will create 100 fields that are based on the TFIDF weighting of word occurrences. You can look into the parameters on the docs page linked to try featuring on characters versus words (e.g. analyzer=char or ngram_range=1-3, max_features=200).

You'll see that the TFIDF algorithm will then turn your text into a useful representation for machine learning, rather than the useless representation you get (for this kind of data) with the one-hot/dummy encoding.

Hope this helps 🙂 if you need more help, try asking on the #machinelearning channel on the Splunk user groups slack - read more about that here.


P.S. If however you do want to change the limit that is imposed, the maximum number of distinct values can be modified in mlspl.conf: see https://docs.splunk.com/Documentation/MLApp/3.3.0/User/Configurefitandapply

View solution in original post

Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!