I have a predicament that keeps recurring. I have a large dataset with a categorical variable. I want to fit a regression and output what the model's predicted value is out to a single column. Currently, I can do this by iteratively subsetting on each level of the categorical variable, fitting the model, then mapping the results back to the output column:
| inputlookup test_generic.csv
| stats values(x1) as x1
| mvexpand x1
| map search="inputlookup test_generic.csv | search x1=$x1$ | fit LinearRegression response from x2"
I would attach the data I prepared for this question, but I don't have the karma. My question is this:
Q: Is there a way to do this by how the | fit LinearRegression ...
is specified?
I have to think there's a better way.
If it helps, this would be fit in R as:
dat <- read.csv("test_generic.csv",header=T)
mod <- lm(response ~ -1 + x1*x2, data=dat)
It could also be fit in python as:
import pandas
import statsmodels.formula.api as sm
dat = pandas.read_csv('test_generic.csv')
mod = sm.ols(formula="response ~ -1 + x1*x2", data=dat).fit()
Thanks in advance!
PS: Here's some data for the test_generic.csv lookup:
"response","x1","x2"
3084,"Alt-Control",221
5623,"Alt-Control",237.8
4957,"Alt-Control",381.5
4019,"Alt-Control",196.8
3283,"Alt-Control",356.45
7365,"Clinical",381.5
3099,"Clinical",483.9
6144,"Clinical",162.6
5499,"Clinical",277.06
3211,"Clinical",422.1
8448,"Control",319.2
14243,"Control",242.5
15917,"Control",229.6
11399,"Control",335.5
6960,"Control",196.9
Hi
Not 100% sure I follow you.
First the |fit command uses a hot encoding for categorical variables, so you can make a continuous regression if you want.
http://docs.splunk.com/Documentation/MLApp/3.3.0/User/Understandfitandapply
and example in action can be found on the splunk blogs ,
https://www.splunk.com/blog/2017/08/28/itsi-and-sophisticated-machine-learning.html
take a look at the this_data_hour variable (a numeric value for the hour of day, made categorical by appending a character and then used in a regression).
If your intention is to have a separate regression for each categorical variable, like a “by” clause in the |stats command for example, then that isn’t supported and you would have to continue to use the map clause as you are currently doing.
If you truly want the statsmodel OLS, you can follow the steps in http://docs.splunk.com/Documentation/MLApp/3.3.0/API/Introduction to import anything you want into your local copy of the MLTK. Check in after .Conf 2018 for more information on that.
Sorry for the delayed response. After reading through all parts and links in your response, I am still using the 'map' solution. I think you may be right, that ultimately the solution will be to import the actual statsmodel OLS using the method outlined in your last link.