All Apps and Add-ons

Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?

AdrienS
Explorer

Hello,

Maybe, it is an easy one and I just did not see it. Basically, I am running the machine learning app to predict a categorical field (OK/NOK).
It worked smoothly and I got some nice predictions. So far so good.
But now, on the hundreds of parameters that I added to predict this categorical field, how do I know which ones are the most important features. In Python with scikit learn, I will do something like that

importances = classifier.feature_importances_
indices = np.argsort(importances)
features = dataset.columns[0:26]
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

However, i would prefer using the Splunk interface (my python skills are pretty limited), so my question, did I miss this option in the app? if not, can I use the results of splunk in the python script (e.g. how to get the features_importances_ as arguments for the script)?

Thanks

0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

Hey @AdrienS,

Cool question - and yes, you're right that you can use the summary command to inspect feature_importances for some of the models (e.g. RandomForestClassifier). Other models may not support the same type of summary however.

You should also check out the FieldSelector algorithm which is really useful for this problem. Under the hood, it uses ANOVA & F-Tests to estimate the linear dependency between variables. Although its univariate (not capturing any interactions between variables), it still can provide a good baseline from choosing a handful of features from hundreds.

An example of its use for regression might look like:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=numeric mode=k_best param=3 target from feature*
| fields fs*

And for classification:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=categorical mode=k_best param=3 target from feature*
| fields fs*

You can try using some of the other modes & params to get different selections. The key thing to checkout are the fields that are prefixed with fs_ as those are the fields that were "selected" by the algorithm.

If you do want to capture interactions, you can manually fabricate interaction features like:

| eval feature2xFeature3 = feature2 * feature3

Another option might be to look at using the RandomForest algorithms (there is a classifier & regressor) with a large number for n_estimators (the trees in the forest). If you look at the feature importances in the summary, you might be able to select a subset of fields you'd like to use.

Another option is to look at using L1-regularization with algorithms like Lasso or Stochastic Gradient Decent by modifying the alpha (both) and penalty (SGD) parameters. After that, you can look at coefficients that have been squashed to zero using summary. See here for more info.

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

Hey @AdrienS,

Cool question - and yes, you're right that you can use the summary command to inspect feature_importances for some of the models (e.g. RandomForestClassifier). Other models may not support the same type of summary however.

You should also check out the FieldSelector algorithm which is really useful for this problem. Under the hood, it uses ANOVA & F-Tests to estimate the linear dependency between variables. Although its univariate (not capturing any interactions between variables), it still can provide a good baseline from choosing a handful of features from hundreds.

An example of its use for regression might look like:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=numeric mode=k_best param=3 target from feature*
| fields fs*

And for classification:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=categorical mode=k_best param=3 target from feature*
| fields fs*

You can try using some of the other modes & params to get different selections. The key thing to checkout are the fields that are prefixed with fs_ as those are the fields that were "selected" by the algorithm.

If you do want to capture interactions, you can manually fabricate interaction features like:

| eval feature2xFeature3 = feature2 * feature3

Another option might be to look at using the RandomForest algorithms (there is a classifier & regressor) with a large number for n_estimators (the trees in the forest). If you look at the feature importances in the summary, you might be able to select a subset of fields you'd like to use.

Another option is to look at using L1-regularization with algorithms like Lasso or Stochastic Gradient Decent by modifying the alpha (both) and penalty (SGD) parameters. After that, you can look at coefficients that have been squashed to zero using summary. See here for more info.

AdrienS
Explorer

I gess |summary is the answer. or I missed something?

0 Karma
Get Updates on the Splunk Community!

Understanding Generative AI Techniques and Their Application in Cybersecurity

Watch On-Demand Artificial intelligence is the talk of the town nowadays, with industries of all kinds ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Using the Splunk Threat Research Team’s Latest Security Content

REGISTER HERE Tech Talk | Security Edition Did you know the Splunk Threat Research Team regularly releases ...