All Apps and Add-ons

Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?

AdrienS
Explorer

Hello,

Maybe, it is an easy one and I just did not see it. Basically, I am running the machine learning app to predict a categorical field (OK/NOK).
It worked smoothly and I got some nice predictions. So far so good.
But now, on the hundreds of parameters that I added to predict this categorical field, how do I know which ones are the most important features. In Python with scikit learn, I will do something like that

importances = classifier.feature_importances_
indices = np.argsort(importances)
features = dataset.columns[0:26]
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

However, i would prefer using the Splunk interface (my python skills are pretty limited), so my question, did I miss this option in the app? if not, can I use the results of splunk in the python script (e.g. how to get the features_importances_ as arguments for the script)?

Thanks

0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

Hey @AdrienS,

Cool question - and yes, you're right that you can use the summary command to inspect feature_importances for some of the models (e.g. RandomForestClassifier). Other models may not support the same type of summary however.

You should also check out the FieldSelector algorithm which is really useful for this problem. Under the hood, it uses ANOVA & F-Tests to estimate the linear dependency between variables. Although its univariate (not capturing any interactions between variables), it still can provide a good baseline from choosing a handful of features from hundreds.

An example of its use for regression might look like:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=numeric mode=k_best param=3 target from feature*
| fields fs*

And for classification:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=categorical mode=k_best param=3 target from feature*
| fields fs*

You can try using some of the other modes & params to get different selections. The key thing to checkout are the fields that are prefixed with fs_ as those are the fields that were "selected" by the algorithm.

If you do want to capture interactions, you can manually fabricate interaction features like:

| eval feature2xFeature3 = feature2 * feature3

Another option might be to look at using the RandomForest algorithms (there is a classifier & regressor) with a large number for n_estimators (the trees in the forest). If you look at the feature importances in the summary, you might be able to select a subset of fields you'd like to use.

Another option is to look at using L1-regularization with algorithms like Lasso or Stochastic Gradient Decent by modifying the alpha (both) and penalty (SGD) parameters. After that, you can look at coefficients that have been squashed to zero using summary. See here for more info.

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

Hey @AdrienS,

Cool question - and yes, you're right that you can use the summary command to inspect feature_importances for some of the models (e.g. RandomForestClassifier). Other models may not support the same type of summary however.

You should also check out the FieldSelector algorithm which is really useful for this problem. Under the hood, it uses ANOVA & F-Tests to estimate the linear dependency between variables. Although its univariate (not capturing any interactions between variables), it still can provide a good baseline from choosing a handful of features from hundreds.

An example of its use for regression might look like:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=numeric mode=k_best param=3 target from feature*
| fields fs*

And for classification:

index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=categorical mode=k_best param=3 target from feature*
| fields fs*

You can try using some of the other modes & params to get different selections. The key thing to checkout are the fields that are prefixed with fs_ as those are the fields that were "selected" by the algorithm.

If you do want to capture interactions, you can manually fabricate interaction features like:

| eval feature2xFeature3 = feature2 * feature3

Another option might be to look at using the RandomForest algorithms (there is a classifier & regressor) with a large number for n_estimators (the trees in the forest). If you look at the feature importances in the summary, you might be able to select a subset of fields you'd like to use.

Another option is to look at using L1-regularization with algorithms like Lasso or Stochastic Gradient Decent by modifying the alpha (both) and penalty (SGD) parameters. After that, you can look at coefficients that have been squashed to zero using summary. See here for more info.

AdrienS
Explorer

I gess |summary is the answer. or I missed something?

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas

Why Splunk Customers Should Attend Cisco Live 2026 Las Vegas     Cisco Live 2026 is almost here, and this ...

Data Management Digest – May 2026

Welcome to the May 2026 edition of Data Management Digest!   As your trusted partner in data innovation, the ...