Hi Splunk Experts--
A colleague of mine and I are exploring the Splunk Machine Learning
Toolkit and, more specifically, using the ML Toolkit to perform
Logistic Regression analysis on a dataset that includes categorical
data as independent variables.
When performing LR on categorical data, we've been taught the
statistical technique of creating "dummy variables" that, in effect,
transform the categorical data into a series of numeric variables.
Example: Imagine a single categorical attribute color with values
["red", "yellow" and "blue"]. That categorical data could be
transformed into three dummy variables (say, is_red , is_yellow and
is_blue ) where each dummy variable would have a value [ 0 or 1 ].
Given this, a data record where color had value yellow would be
transformed into
is_red: 0
is_yellow: 1
is_blue: 0
We were both taught that when using dummy variables in Logistic
Regression, you need to omit one dummy variable from the set
representing a given categorical variable. Doing this prevents
double-counting of that omitted categorical value (say, is_blue )
because having zeros in all other dummy variables effectively
represents a one in the omitted dummy variable.
We've been crawling through the ML Toolkit (Logistic Regression)
source code to see how it handles categorical data and have found
something that surprises both of us: Specifically, the
prepare_features method in df_util.py (see below), which uses pandas
to create dummy variables for categorical data, by invoking the pandas
get_dummies method (see line 27 below).
def prepare_features(X, variables, final_columns=None, get_dummies=True):
"""Prepare features.
This method defines conventional steps to prepare features:
- drop unused columns
- drop rows that have missing values
- optionally (if get_dummies==True)
- convert categorical fields into indicator dummy variables
- optionally (if final_column is provided)
- make the resulting dataframe match final_columns
Args:
X (dataframe): input dataframe
variables (list): column names
final_columns (list): finalized column names
get_dummies (bool): indicate if categorical variable should be converted
Returns:
X (dataframe): prepared feature dataframe
nans (np array): boolean array to indicate which rows have missing
values in the original dataframe
columns (list): sorted list of feature column names
"""
X, nans = drop_unused_and_missing(X, variables)
if get_dummies:
filter_non_numeric(X)
X = pd.get_dummies(X, prefix_sep='=', sparse=True)
if final_columns is not None:
drop_unused_fields(X, final_columns)
assert_any_fields(X)
fill_missing_fields(X, final_columns)
assert_any_rows(X)
assert_any_fields(X)
columns = sort_fields(X)
return (X, nans, columns)
The ML toolkit seems to use pandas 0.17. In pandas 0.18 the
get_dummies method supports a drop_first parameter which omits the
first dummy variable for a categorical variable, but that's not
available in pandas 0.17. To us this means that the Splunk ML Toolkit
code should contain code to drop one of the dummy variables returned
by pandas-- and we don't see code that does this.
So (finally!) here are our questions:
Are the assertions/interpretations above correct?
If so does it follow that the ML Toolkit is not handling categorical
data correctly-- that it will produce biased models when the input
contains categorical data?
And if so, is there a technique for using the ML Toolkit to perform
Logistic Regression on categorical data that allows creation of
models without this bias?
'Hope this is reasonably clear-- thanks in advance for any advice!
... View more