topic Re: Splunk Machine Learning Toolkit: How do I determine which parameters are the most important? in All Apps and Add-ons

Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?

AdrienS — Tue, 29 Sep 2020 15:31:05 GMT

Hello,

Maybe, it is an easy one and I just did not see it. Basically, I am running the machine learning app to predict a categorical field (OK/NOK).
It worked smoothly and I got some nice predictions. So far so good.
But now, on the hundreds of parameters that I added to predict this categorical field, how do I know which ones are the most important features. In Python with scikit learn, I will do something like that

importances = classifier.feature_importances_
indices = np.argsort(importances)
features = dataset.columns[0:26]
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

However, i would prefer using the Splunk interface (my python skills are pretty limited), so my question, did I miss this option in the app? if not, can I use the results of splunk in the python script (e.g. how to get the features_importances_ as arguments for the script)?

Thanks

Re: Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?

AdrienS — Tue, 29 Aug 2017 06:18:54 GMT

I gess |summary is the answer. or I missed something?

Re: Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?

aljohnson_splun — Wed, 30 Aug 2017 16:55:37 GMT

Hey @AdrienS,

Cool question - and yes, you're right that you can use the summary command to inspect feature_importances for some of the models (e.g. RandomForestClassifier). Other models may not support the same type of summary however.

You should also check out the FieldSelector algorithm which is really useful for this problem. Under the hood, it uses ANOVA & F-Tests to estimate the linear dependency between variables. Although its univariate (not capturing any interactions between variables), it still can provide a good baseline from choosing a handful of features from hundreds.

An example of its use for regression might look like:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=numeric mode=k_best param=3 target from feature*
| fields fs*

And for classification:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=categorical mode=k_best param=3 target from feature*
| fields fs*

You can try using some of the other modes & params to get different selections. The key thing to checkout are the fields that are prefixed with fs_ as those are the fields that were "selected" by the algorithm.

If you do want to capture interactions, you can manually fabricate interaction features like:

| eval feature2xFeature3 = feature2 * feature3

Another option might be to look at using the RandomForest algorithms (there is a classifier & regressor) with a large number for n_estimators (the trees in the forest). If you look at the feature importances in the summary, you might be able to select a subset of fields you'd like to use.

Another option is to look at using L1-regularization with algorithms like Lasso or Stochastic Gradient Decent by modifying the alpha (both) and penalty (SGD) parameters. After that, you can look at coefficients that have been squashed to zero using summary. See here for more info.