All Apps and Add-ons

Can I fit independent slopes and intercepts in a single Linear Regression fit statement by levels of a categorical variable?

nrohbock
Explorer

I have a predicament that keeps recurring. I have a large dataset with a categorical variable. I want to fit a regression and output what the model's predicted value is out to a single column. Currently, I can do this by iteratively subsetting on each level of the categorical variable, fitting the model, then mapping the results back to the output column:

| inputlookup test_generic.csv 
| stats values(x1) as x1 
| mvexpand x1 
| map search="inputlookup test_generic.csv | search x1=$x1$ | fit LinearRegression response from x2"

I would attach the data I prepared for this question, but I don't have the karma. My question is this:

Q: Is there a way to do this by how the | fit LinearRegression ... is specified?

I have to think there's a better way.

If it helps, this would be fit in R as:

dat <- read.csv("test_generic.csv",header=T)
mod <- lm(response ~ -1 + x1*x2, data=dat)

It could also be fit in python as:

import pandas
import statsmodels.formula.api as sm

dat = pandas.read_csv('test_generic.csv')
mod = sm.ols(formula="response ~ -1 + x1*x2", data=dat).fit()

Thanks in advance!

PS: Here's some data for the test_generic.csv lookup:

"response","x1","x2"
3084,"Alt-Control",221
5623,"Alt-Control",237.8
4957,"Alt-Control",381.5
4019,"Alt-Control",196.8
3283,"Alt-Control",356.45
7365,"Clinical",381.5
3099,"Clinical",483.9
6144,"Clinical",162.6
5499,"Clinical",277.06
3211,"Clinical",422.1
8448,"Control",319.2
14243,"Control",242.5
15917,"Control",229.6
11399,"Control",335.5
6960,"Control",196.9

0 Karma

astein_splunk
Splunk Employee
Splunk Employee

Hi

Not 100% sure I follow you.

First the |fit command uses a hot encoding for categorical variables, so you can make a continuous regression if you want.

http://docs.splunk.com/Documentation/MLApp/3.3.0/User/Understandfitandapply

and example in action can be found on the splunk blogs ,
https://www.splunk.com/blog/2017/08/28/itsi-and-sophisticated-machine-learning.html
take a look at the this_data_hour variable (a numeric value for the hour of day, made categorical by appending a character and then used in a regression).

If your intention is to have a separate regression for each categorical variable, like a “by” clause in the |stats command for example, then that isn’t supported and you would have to continue to use the map clause as you are currently doing.

If you truly want the statsmodel OLS, you can follow the steps in http://docs.splunk.com/Documentation/MLApp/3.3.0/API/Introduction to import anything you want into your local copy of the MLTK. Check in after .Conf 2018 for more information on that.

nrohbock
Explorer

Sorry for the delayed response. After reading through all parts and links in your response, I am still using the 'map' solution. I think you may be right, that ultimately the solution will be to import the actual statsmodel OLS using the method outlined in your last link.

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...