Thank you for asking, this is an incredibly valuable question! You have a very good understanding of dummy variables.
First, about the bias in the model. Let's assume you have dummy variables x1 , x2 , x3 , such that x1 + x2 + x3 = 1 ,
With m-1 dummy variables, your linear model can be expressed as
y = α0 + α1 * x1 + α2 * x2
With m dummy variables, your linear model is now:
y = β0 + β1 * x1 + β2 * x2 + β3 * x3
Since x3 = 1 − x1 − x2 , you get
y = β0 + β1 * x1 + β2 * x2 + β3 * (1 − x1 − x2) = (β0 + β3) + (β1 − β3) * x1 + (β2 − β3) * x2
Essentially you have
α0 = β0 + β3 , α1 = β1 − β3 , α2 = β2 − β3
So, these two models are equivalent, and there is no bias introduced as you see in this exercise.
Now, the question is, what's introduced here? Collinearity is what you are after, since you can always tell the value of the left out dummy variable if you know m-1 of them. Collinearity can cause computational problems for linear regression since the matrix inversion can not be performed. But for logistic regression, depending on the computational scheme under the hood, e.g. gradient descent, numerical instability may not be an issue. Moreover, the LogisticRegression model in sklearn uses a regularization, penalty='l2' and C=1.0 , which means feature collinearity will be penalized. Therefore, using the full m dummy variables instead of m-1 does not introduce bias to the model, except for potential numerical instability.
In practice, to avoid the potential numerical instability issue, if you decide to go with m-1 dummy variables, you may have the following options:
1) With latest version of MLTK (you are right it uses pandas 0.17), you can modify the prepare_features_and_target method in df_util.py , instead of doing
X = pd.get_dummies(X, prefix_sep='=', sparse=True)
you can use the following code to drop the first column of the created dummy variables for each categorical variable:
columns_to_encode = X.select_dtypes(include=['object', 'category']).columns
for col in columns_to_encode:
X = X.join(pd.get_dummies(X.pop(col), prefix=col, prefix_sep='=').iloc[:, 1:])
2) As you already mentioned in your post, drop_first=True is supported in pandas 0.18+, you could use this when a future version of Python for Scientific Computing is released.
On the other hand, if you want to reduce the effect of collinearity in your model, you can also use some preprocessing methods, e.g. Field Selector to select features, or PCA to remove collinearity. You can also use algorithms like Random Forest that are least affected by feature multicollinearity.
Hope it helps clarify some of the issues.
... View more