All Apps and Add-ons

Can I fit independent slopes and intercepts in a single Linear Regression fit statement by levels of a categorical variable?

nrohbock
Explorer

I have a predicament that keeps recurring. I have a large dataset with a categorical variable. I want to fit a regression and output what the model's predicted value is out to a single column. Currently, I can do this by iteratively subsetting on each level of the categorical variable, fitting the model, then mapping the results back to the output column:

| inputlookup test_generic.csv 
| stats values(x1) as x1 
| mvexpand x1 
| map search="inputlookup test_generic.csv | search x1=$x1$ | fit LinearRegression response from x2"

I would attach the data I prepared for this question, but I don't have the karma. My question is this:

Q: Is there a way to do this by how the | fit LinearRegression ... is specified?

I have to think there's a better way.

If it helps, this would be fit in R as:

dat <- read.csv("test_generic.csv",header=T)
mod <- lm(response ~ -1 + x1*x2, data=dat)

It could also be fit in python as:

import pandas
import statsmodels.formula.api as sm

dat = pandas.read_csv('test_generic.csv')
mod = sm.ols(formula="response ~ -1 + x1*x2", data=dat).fit()

Thanks in advance!

PS: Here's some data for the test_generic.csv lookup:

"response","x1","x2"
3084,"Alt-Control",221
5623,"Alt-Control",237.8
4957,"Alt-Control",381.5
4019,"Alt-Control",196.8
3283,"Alt-Control",356.45
7365,"Clinical",381.5
3099,"Clinical",483.9
6144,"Clinical",162.6
5499,"Clinical",277.06
3211,"Clinical",422.1
8448,"Control",319.2
14243,"Control",242.5
15917,"Control",229.6
11399,"Control",335.5
6960,"Control",196.9

0 Karma

astein_splunk
Splunk Employee
Splunk Employee

Hi

Not 100% sure I follow you.

First the |fit command uses a hot encoding for categorical variables, so you can make a continuous regression if you want.

http://docs.splunk.com/Documentation/MLApp/3.3.0/User/Understandfitandapply

and example in action can be found on the splunk blogs ,
https://www.splunk.com/blog/2017/08/28/itsi-and-sophisticated-machine-learning.html
take a look at the this_data_hour variable (a numeric value for the hour of day, made categorical by appending a character and then used in a regression).

If your intention is to have a separate regression for each categorical variable, like a “by” clause in the |stats command for example, then that isn’t supported and you would have to continue to use the map clause as you are currently doing.

If you truly want the statsmodel OLS, you can follow the steps in http://docs.splunk.com/Documentation/MLApp/3.3.0/API/Introduction to import anything you want into your local copy of the MLTK. Check in after .Conf 2018 for more information on that.

nrohbock
Explorer

Sorry for the delayed response. After reading through all parts and links in your response, I am still using the 'map' solution. I think you may be right, that ultimately the solution will be to import the actual statsmodel OLS using the method outlined in your last link.

0 Karma
Get Updates on the Splunk Community!

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI!Discover how Splunk’s agentic AI ...

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Watch On Demand the Tech Talk on November 6 at 11AM PT, and empower your SOC to reach new heights! Duration: ...

Splunk Observability as Code: From Zero to Dashboard

For the details on what Self-Service Observability and Observability as Code is, we have some awesome content ...