<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models? in All Apps and Add-ons</title>
    <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345055#M41710</link>
    <description>&lt;P&gt;Hi yangzd! Thanks for an awesome (clear and thorough) answer. We really appreciate your taking the time it must have taken to write this up. It's not clear if (or to what extent) our specific use-cases will be subject to the stability problem you describe. After some conversation I believe we're going to clone the ML Toolkit and compare models created with the stock and modified version. Let me know if you're interested in hearing about the results. And thanks so much again for your help.  ..j&lt;/P&gt;</description>
    <pubDate>Wed, 20 Dec 2017 00:56:19 GMT</pubDate>
    <dc:creator>jsinnott_</dc:creator>
    <dc:date>2017-12-20T00:56:19Z</dc:date>
    <item>
      <title>Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345051#M41706</link>
      <description>&lt;P&gt;Hi Splunk Experts--&lt;/P&gt;

&lt;P&gt;A colleague of mine and I are exploring the Splunk Machine Learning&lt;BR /&gt;
Toolkit and, more specifically, using the ML Toolkit to perform&lt;BR /&gt;
Logistic Regression analysis on a dataset that includes categorical&lt;BR /&gt;
data as independent variables.&lt;/P&gt;

&lt;P&gt;When performing LR on categorical data, we've been taught the&lt;BR /&gt;
statistical technique of creating "dummy variables" that, in effect,&lt;BR /&gt;
transform the categorical data into a series of numeric variables.&lt;/P&gt;

&lt;P&gt;Example: Imagine a single categorical attribute &lt;CODE&gt;color&lt;/CODE&gt; with values&lt;BR /&gt;
["red", "yellow" and "blue"]. That categorical data could be&lt;BR /&gt;
transformed into three dummy variables (say, &lt;CODE&gt;is_red&lt;/CODE&gt;, &lt;CODE&gt;is_yellow&lt;/CODE&gt; and&lt;BR /&gt;
&lt;CODE&gt;is_blue&lt;/CODE&gt;) where each dummy variable would have a value [&lt;CODE&gt;0&lt;/CODE&gt; or &lt;CODE&gt;1&lt;/CODE&gt;].&lt;/P&gt;

&lt;P&gt;Given this, a data record where &lt;CODE&gt;color&lt;/CODE&gt; had value &lt;CODE&gt;yellow&lt;/CODE&gt; would be&lt;BR /&gt;
transformed into&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;  is_red:    0
  is_yellow: 1
  is_blue:   0
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;We were both taught that when using dummy variables in Logistic&lt;BR /&gt;
Regression, you need to omit one dummy variable from the set&lt;BR /&gt;
representing a given categorical variable. Doing this prevents&lt;BR /&gt;
double-counting of that omitted categorical value (say, &lt;CODE&gt;is_blue&lt;/CODE&gt;)&lt;BR /&gt;
because having zeros in all other dummy variables effectively&lt;BR /&gt;
represents a one in the omitted dummy variable.&lt;/P&gt;

&lt;P&gt;We've been crawling through the ML Toolkit (Logistic Regression)&lt;BR /&gt;
source code to see how it handles categorical data and have found&lt;BR /&gt;
something that surprises both of us: Specifically, the&lt;BR /&gt;
&lt;CODE&gt;prepare_features&lt;/CODE&gt; method in &lt;CODE&gt;df_util.py&lt;/CODE&gt; (see below), which uses pandas&lt;BR /&gt;
to create dummy variables for categorical data, by invoking the pandas&lt;BR /&gt;
&lt;CODE&gt;get_dummies&lt;/CODE&gt; method (see line 27 below).&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;  def prepare_features(X, variables, final_columns=None, get_dummies=True):
      """Prepare features.

      This method defines conventional steps to prepare features:
          - drop unused columns
          - drop rows that have missing values
          - optionally (if get_dummies==True)
              - convert categorical fields into indicator dummy variables
          - optionally (if final_column is provided)
              - make the resulting dataframe match final_columns

      Args:
          X (dataframe): input dataframe
          variables (list): column names
          final_columns (list): finalized column names
          get_dummies (bool): indicate if categorical variable should be converted

      Returns:
          X (dataframe): prepared feature dataframe
          nans (np array): boolean array to indicate which rows have missing
              values in the original dataframe
          columns (list): sorted list of feature column names
      """
      X, nans = drop_unused_and_missing(X, variables)
      if get_dummies:
          filter_non_numeric(X)
          X = pd.get_dummies(X, prefix_sep='=', sparse=True)
      if final_columns is not None:
          drop_unused_fields(X, final_columns)
          assert_any_fields(X)
          fill_missing_fields(X, final_columns)
      assert_any_rows(X)
      assert_any_fields(X)
      columns = sort_fields(X)
      return (X, nans, columns)
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The ML toolkit seems to use pandas 0.17. In pandas 0.18 the&lt;BR /&gt;
&lt;CODE&gt;get_dummies&lt;/CODE&gt; method supports a &lt;CODE&gt;drop_first&lt;/CODE&gt; parameter which omits the&lt;BR /&gt;
first dummy variable for a categorical variable, but that's not&lt;BR /&gt;
available in pandas 0.17. To us this means that the Splunk ML Toolkit&lt;BR /&gt;
code should contain code to drop one of the dummy variables returned&lt;BR /&gt;
by pandas-- and we don't see code that does this.&lt;/P&gt;

&lt;P&gt;So (finally!) here are our questions:&lt;/P&gt;

&lt;UL&gt;
&lt;LI&gt;&lt;P&gt;Are the assertions/interpretations above correct?&lt;/P&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;P&gt;If so does it follow that the ML Toolkit is not handling categorical&lt;BR /&gt;
data correctly-- that it will produce biased models when the input&lt;BR /&gt;
contains categorical data?&lt;/P&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;P&gt;And if so, is there a technique for using the ML Toolkit to perform&lt;BR /&gt;
Logistic Regression on categorical data that allows creation of&lt;BR /&gt;
models without this bias?&lt;/P&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;'Hope this is reasonably clear-- thanks in advance for any advice!&lt;/P&gt;</description>
      <pubDate>Wed, 13 Dec 2017 20:30:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345051#M41706</guid>
      <dc:creator>jsinnott_</dc:creator>
      <dc:date>2017-12-13T20:30:31Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345052#M41707</link>
      <description>&lt;P&gt;This question ignited some interesting discussion on the ML teams here, I'll try to nudge someone into answering. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 16 Dec 2017 01:03:04 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345052#M41707</guid>
      <dc:creator>acruise_splunk</dc:creator>
      <dc:date>2017-12-16T01:03:04Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345053#M41708</link>
      <description>&lt;P&gt;Hi! Thanks so much for your reply-- my colleagues and I eagerly await your thoughts. Glad to provide more info/context/etc. if that'd be helpful.  ..j&lt;/P&gt;</description>
      <pubDate>Mon, 18 Dec 2017 03:39:05 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345053#M41708</guid>
      <dc:creator>jsinnott_</dc:creator>
      <dc:date>2017-12-18T03:39:05Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345054#M41709</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Thank you for asking, this is an incredibly valuable question! You have a very good understanding of dummy variables.&lt;/P&gt;

&lt;P&gt;First, about the bias in the model. Let's assume you have dummy variables &lt;CODE&gt;x1&lt;/CODE&gt;, &lt;CODE&gt;x2&lt;/CODE&gt;, &lt;CODE&gt;x3&lt;/CODE&gt;, such that &lt;CODE&gt;x1 + x2 + x3 = 1&lt;/CODE&gt;,&lt;BR /&gt;
With &lt;CODE&gt;m-1&lt;/CODE&gt; dummy variables, your linear model can be expressed as&lt;BR /&gt;
    &lt;CODE&gt;y = α0 + α1 * x1 + α2 * x2&lt;/CODE&gt;&lt;BR /&gt;
With &lt;CODE&gt;m&lt;/CODE&gt; dummy variables, your linear model is now:&lt;BR /&gt;
    &lt;CODE&gt;y = β0 + β1 * x1 + β2 * x2 + β3 * x3&lt;/CODE&gt;&lt;BR /&gt;
Since &lt;CODE&gt;x3 = 1 − x1 − x2&lt;/CODE&gt;, you get&lt;BR /&gt;
    &lt;CODE&gt;y = β0 + β1 * x1 + β2 * x2 + β3 * (1 − x1 − x2) = (β0 + β3) + (β1 − β3) * x1 + (β2 − β3) * x2&lt;/CODE&gt;&lt;BR /&gt;
Essentially you have&lt;BR /&gt;
    &lt;CODE&gt;α0 = β0 + β3&lt;/CODE&gt;, &lt;CODE&gt;α1 = β1 − β3&lt;/CODE&gt;, &lt;CODE&gt;α2 = β2 − β3&lt;/CODE&gt;&lt;/P&gt;

&lt;P&gt;So, these two models are equivalent, and there is no bias introduced as you see in this exercise.&lt;/P&gt;

&lt;P&gt;Now, the question is, what's introduced here? Collinearity is what you are after, since you can always tell the value of the left out dummy variable if you know &lt;CODE&gt;m-1&lt;/CODE&gt; of them. Collinearity can cause computational problems for linear regression since the matrix inversion can not be performed. But for logistic regression, depending on the computational scheme under the hood, e.g. gradient descent, numerical instability may not be an issue. Moreover, the LogisticRegression model in sklearn uses a regularization, &lt;CODE&gt;penalty='l2'&lt;/CODE&gt; and &lt;CODE&gt;C=1.0&lt;/CODE&gt;, which means feature collinearity will be penalized. Therefore, using the full &lt;CODE&gt;m&lt;/CODE&gt; dummy variables instead of &lt;CODE&gt;m-1&lt;/CODE&gt; does not introduce bias to the model, except for potential numerical instability.&lt;/P&gt;

&lt;P&gt;In practice, to avoid the potential numerical instability issue, if you decide to go with &lt;CODE&gt;m-1&lt;/CODE&gt; dummy variables, you may have the following options:&lt;/P&gt;

&lt;P&gt;1) With latest version of MLTK (you are right it uses pandas 0.17), you can modify the &lt;CODE&gt;prepare_features_and_target&lt;/CODE&gt; method in &lt;CODE&gt;df_util.py&lt;/CODE&gt;, instead of doing&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;    X = pd.get_dummies(X, prefix_sep='=', sparse=True)
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;you can use the following code to drop the first column of the created dummy variables for each categorical variable:&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;    columns_to_encode = X.select_dtypes(include=['object', 'category']).columns
    for col in columns_to_encode:
        X = X.join(pd.get_dummies(X.pop(col), prefix=col, prefix_sep='=').iloc[:, 1:])
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;2) As you already mentioned in your post, &lt;CODE&gt;drop_first=True&lt;/CODE&gt; is supported in pandas 0.18+, you could use this when a future version of Python for Scientific Computing is released. &lt;/P&gt;

&lt;P&gt;On the other hand, if you want to reduce the effect of collinearity in your model, you can also use some preprocessing methods, e.g. Field Selector to select features, or PCA to remove collinearity. You can also use algorithms like Random Forest that are least affected by feature multicollinearity.&lt;/P&gt;

&lt;P&gt;Hope it helps clarify some of the issues.&lt;/P&gt;

&lt;P&gt;zd&lt;/P&gt;</description>
      <pubDate>Tue, 19 Dec 2017 07:32:28 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345054#M41709</guid>
      <dc:creator>yangzd</dc:creator>
      <dc:date>2017-12-19T07:32:28Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345055#M41710</link>
      <description>&lt;P&gt;Hi yangzd! Thanks for an awesome (clear and thorough) answer. We really appreciate your taking the time it must have taken to write this up. It's not clear if (or to what extent) our specific use-cases will be subject to the stability problem you describe. After some conversation I believe we're going to clone the ML Toolkit and compare models created with the stock and modified version. Let me know if you're interested in hearing about the results. And thanks so much again for your help.  ..j&lt;/P&gt;</description>
      <pubDate>Wed, 20 Dec 2017 00:56:19 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345055#M41710</guid>
      <dc:creator>jsinnott_</dc:creator>
      <dc:date>2017-12-20T00:56:19Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345056#M41711</link>
      <description>&lt;P&gt;Hi jsinnott_, that sounds great. Yes please do let us know the comparison results. Look forward to it.  -zd&lt;/P&gt;</description>
      <pubDate>Wed, 20 Dec 2017 18:30:26 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345056#M41711</guid>
      <dc:creator>yangzd</dc:creator>
      <dc:date>2017-12-20T18:30:26Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345057#M41712</link>
      <description>&lt;P&gt;In addition to @yangzd's response below, you can do your own categorical encoding quite simply with eval.  Say for example our field &lt;CODE&gt;color&lt;/CODE&gt; with values &lt;CODE&gt;is_red&lt;/CODE&gt;, &lt;CODE&gt;is_yellow&lt;/CODE&gt;, and &lt;CODE&gt;is_blue&lt;/CODE&gt;, and you'd like to to encode the 3 levels into two dummy variables (treating &lt;CODE&gt;is_blue&lt;/CODE&gt; as the base):&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;| eval {color} = 1
| fillnull is_red is_yellow
| fields - is_blue
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The &lt;CODE&gt;{color}&lt;/CODE&gt; on the left side of eval will take the value of the field and use it as the name of the new field. &lt;/P&gt;</description>
      <pubDate>Thu, 21 Dec 2017 05:07:51 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345057#M41712</guid>
      <dc:creator>aljohnson_splun</dc:creator>
      <dc:date>2017-12-21T05:07:51Z</dc:date>
    </item>
    <item>
      <title>Re: Splunk Machine Learning Toolkit: Does ML Toolkit categorical data generate biased (Logistic Regression) models?</title>
      <link>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345058#M41713</link>
      <description>&lt;P&gt;Hi aljohnson-- Thanks very much for this. This, it turns out, is the method we're using to do the comparison between letting the ML Toolkit handle categorical data (described above) and converting our categorical data to dummy variables prior to invoking the &lt;CODE&gt;fit&lt;/CODE&gt; command. In fact, we generalize this so something like (for a categorical attribute "foo"):&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;...
| eval foo_is_{foo} = 1
...
| foreach foo_is_* [ eval &amp;lt;&amp;lt;FIELD&amp;gt;&amp;gt;=coalesce(&amp;lt;&amp;lt;FIELD&amp;gt;&amp;gt;,0) ]
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;Thanks for taking the time to comment!&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jan 2018 17:14:10 GMT</pubDate>
      <guid>https://community.splunk.com/t5/All-Apps-and-Add-ons/Splunk-Machine-Learning-Toolkit-Does-ML-Toolkit-categorical-data/m-p/345058#M41713</guid>
      <dc:creator>jsinnott_</dc:creator>
      <dc:date>2018-01-02T17:14:10Z</dc:date>
    </item>
  </channel>
</rss>

