<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Splunk Machine Learning Toolkit: How do I determine which parameters are the most important? in All Apps and Add-ons</title>
    <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-How-do-I-determine-which/m-p/303340#M36140</link>
    <description>&lt;P&gt;Hey @AdrienS,&lt;/P&gt;

&lt;P&gt;Cool question - and yes, you're right that you can use the &lt;CODE&gt;summary&lt;/CODE&gt; command to inspect feature_importances for some of the models (e.g. RandomForestClassifier). Other models may not support the same type of summary however. &lt;/P&gt;

&lt;P&gt;You should also check out the &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#FieldSelector"&gt;FieldSelector algorithm&lt;/A&gt; which is really useful for this problem. Under the hood, it uses ANOVA &amp;amp; F-Tests to estimate the linear dependency between variables. Although its univariate (not capturing any interactions between variables), it still can provide a good baseline from choosing a handful of features from hundreds.&lt;/P&gt;

&lt;P&gt;An example of its use for regression might look like:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=numeric mode=k_best param=3 target from feature*
| fields fs*
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;And for classification:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=categorical mode=k_best param=3 target from feature*
| fields fs*
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;You can try using some of the other modes &amp;amp; params to get different selections. The key thing to checkout are the fields that are prefixed with &lt;CODE&gt;fs_&lt;/CODE&gt; as those are the fields that were "selected" by the algorithm.&lt;/P&gt;

&lt;P&gt;If you do want to capture interactions, you can manually fabricate interaction features like:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| eval feature2xFeature3 = feature2 * feature3
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Another option might be to look at using the &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#RandomForestRegressor"&gt;RandomForest algorithms (there is a classifier &amp;amp; regressor)&lt;/A&gt; with a large number for &lt;CODE&gt;n_estimators&lt;/CODE&gt; (the trees in the forest). If you look at the feature importances in the summary, you might be able to select a subset of fields you'd like to use.&lt;/P&gt;

&lt;P&gt;Another option is to look at using  L1-regularization with algorithms like &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#Lasso"&gt;Lasso&lt;/A&gt; or &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#SGDClassifier"&gt;Stochastic Gradient Decent&lt;/A&gt; by modifying the &lt;CODE&gt;alpha&lt;/CODE&gt; (both) and &lt;CODE&gt;penalty&lt;/CODE&gt; (SGD) parameters. After that, you can look at coefficients that have been squashed to zero using &lt;CODE&gt;summary&lt;/CODE&gt;. See &lt;A href="http://scikit-learn.org/stable/modules/feature_selection.html#l1-based-feature-selection"&gt;here&lt;/A&gt; for more info.&lt;/P&gt;</description>
    <pubDate>Wed, 30 Aug 2017 16:55:37 GMT</pubDate>
    <dc:creator>aljohnson_splun</dc:creator>
    <dc:date>2017-08-30T16:55:37Z</dc:date>
    <item>
      <title>Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-How-do-I-determine-which/m-p/303338#M36138</link>
      <description>&lt;P&gt;Hello, &lt;/P&gt;

&lt;P&gt;Maybe, it is an easy one and I just did not see it. Basically, I am running the machine learning app to predict a categorical field (OK/NOK). &lt;BR /&gt;
It worked smoothly and I got some nice predictions. So far so good. &lt;BR /&gt;
But now, on the hundreds of parameters that I added to predict this categorical field, how do I know which ones are the most important features. In Python with scikit learn, I will do something like that &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;importances = classifier.feature_importances_
indices = np.argsort(importances)
features = dataset.columns[0:26]
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;However, i would prefer using the Splunk interface (my python skills are pretty limited), so my question, did I miss this option in the app? if not, can I use the results of splunk in the python script (e.g. how to get the features_importances_ as arguments for the script)?&lt;/P&gt;

&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 29 Sep 2020 15:31:05 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-How-do-I-determine-which/m-p/303338#M36138</guid>
      <dc:creator>AdrienS</dc:creator>
      <dc:date>2020-09-29T15:31:05Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-How-do-I-determine-which/m-p/303339#M36139</link>
      <description>&lt;P&gt;I gess &lt;CODE&gt;|summary&lt;/CODE&gt; is the answer. or I missed something?&lt;/P&gt;</description>
      <pubDate>Tue, 29 Aug 2017 06:18:54 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-How-do-I-determine-which/m-p/303339#M36139</guid>
      <dc:creator>AdrienS</dc:creator>
      <dc:date>2017-08-29T06:18:54Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: How do I determine which parameters are the most important?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-How-do-I-determine-which/m-p/303340#M36140</link>
      <description>&lt;P&gt;Hey @AdrienS,&lt;/P&gt;

&lt;P&gt;Cool question - and yes, you're right that you can use the &lt;CODE&gt;summary&lt;/CODE&gt; command to inspect feature_importances for some of the models (e.g. RandomForestClassifier). Other models may not support the same type of summary however. &lt;/P&gt;

&lt;P&gt;You should also check out the &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#FieldSelector"&gt;FieldSelector algorithm&lt;/A&gt; which is really useful for this problem. Under the hood, it uses ANOVA &amp;amp; F-Tests to estimate the linear dependency between variables. Although its univariate (not capturing any interactions between variables), it still can provide a good baseline from choosing a handful of features from hundreds.&lt;/P&gt;

&lt;P&gt;An example of its use for regression might look like:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=numeric mode=k_best param=3 target from feature*
| fields fs*
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;And for classification:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;index=foo
| fields target feature1 feature2 feature3 feature4 .... feature1000
| fit FieldSelector type=categorical mode=k_best param=3 target from feature*
| fields fs*
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;You can try using some of the other modes &amp;amp; params to get different selections. The key thing to checkout are the fields that are prefixed with &lt;CODE&gt;fs_&lt;/CODE&gt; as those are the fields that were "selected" by the algorithm.&lt;/P&gt;

&lt;P&gt;If you do want to capture interactions, you can manually fabricate interaction features like:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| eval feature2xFeature3 = feature2 * feature3
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Another option might be to look at using the &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#RandomForestRegressor"&gt;RandomForest algorithms (there is a classifier &amp;amp; regressor)&lt;/A&gt; with a large number for &lt;CODE&gt;n_estimators&lt;/CODE&gt; (the trees in the forest). If you look at the feature importances in the summary, you might be able to select a subset of fields you'd like to use.&lt;/P&gt;

&lt;P&gt;Another option is to look at using  L1-regularization with algorithms like &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#Lasso"&gt;Lasso&lt;/A&gt; or &lt;A href="http://docs.splunk.com/Documentation/MLApp/latest/User/Algorithms#SGDClassifier"&gt;Stochastic Gradient Decent&lt;/A&gt; by modifying the &lt;CODE&gt;alpha&lt;/CODE&gt; (both) and &lt;CODE&gt;penalty&lt;/CODE&gt; (SGD) parameters. After that, you can look at coefficients that have been squashed to zero using &lt;CODE&gt;summary&lt;/CODE&gt;. See &lt;A href="http://scikit-learn.org/stable/modules/feature_selection.html#l1-based-feature-selection"&gt;here&lt;/A&gt; for more info.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Aug 2017 16:55:37 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-How-do-I-determine-which/m-p/303340#M36140</guid>
      <dc:creator>aljohnson_splun</dc:creator>
      <dc:date>2017-08-30T16:55:37Z</dc:date>
    </item>
  </channel>
</rss>

