All Apps and Add-ons

Label Encoding in Machine Learning Toolkit

wbw4am
New Member

I believe by default the Machine Learning Toolkit utilizes one hot encoding when converting categorical variables to numerical. Is there an easy way to utilize label encoding? For example - I want to assign a risk score based on country. So China may map to a 5 and US may map to a 1, where 5 is riskier than 1.

I imagine I could do this with a bunch of eval commands in the query or alternatively an additional field extract, but is there a "prettier" way to do this?

0 Karma

aoliner_splunk
Splunk Employee
Splunk Employee

One option, if the scores are relatively static, is to use a lookup. Another option, if you've calculated all the 'risk_score's and want to keep them "up to date" as conditions change, is to use a regression model:

... | fit LinearRegression risk_score from factor_A factor_B ... into my_risk_model

You could use whatever regression algorithm you want and whatever factors you want. Then, when it's time to score:

... | apply my_risk_model
0 Karma

aljohnson_splun
Splunk Employee
Splunk Employee

Have you considered using a lookup?

0 Karma
Get Updates on the Splunk Community!

Data Management Digest – December 2025

Welcome to the December edition of Data Management Digest! As we continue our journey of data innovation, the ...

Index This | What is broken 80% of the time by February?

December 2025 Edition   Hayyy Splunk Education Enthusiasts and the Eternally Curious!    We’re back with this ...

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Hello Splunk Community,   We're thrilled to share an exciting update that will help you manage your data more ...