How to create a column labeling rows based on cond...

johnsmith454 · ‎09-21-2020

I have tried using answers to similar questions on here, but I'm having a problem where I want to create a column of 4 labels. However, when I try to create these, the labels I make eat into re-labeling the first label I have assigned. For example, I am looking to create a label a column like this:

Gene   Feature1   Feature2   Feature3 ...  label
Gene1   1            3         1            most likely
Gene2   0            0         1            probable
Gene3   NA           NA        NA           unknown
Gene4   0            0         0            unlikely

However, my data is imported from big data analysis and so my features are not represented here, but the 4 labels are what I'm trying to get. I try to code this with:

df$label[(df$Mechanism == 1)|(df$med >= 3) |(df$OMIM == 1)] <- "most likely"

df$label[is.na(df$label) & (df$med <= 2 )|(df$SideeffectFreq>=1) |(df$MGI_Gene==1) |(df$model_Gene==1) |(df$Rank>=1) ] <- "probable"

df$label[(df$Causality == 'least likely')] <- "least likely"

df$label[is.na(df$label)] <- "unknown"

When I run the first line to create the "most likely" label, this labels 50 genes (which is what I expected), but running the second line for "probable" re-labels some of the "most likely" genes to only give 34 of them left. I thought using is.na(df$label) or (df$label != 'most likely') would resolve this, but neither do.

Is there a better way to go about creating a labels column like this? I am new to coding so also if anyone can explain why the is.na(df$label) or (df$label != 'most likely') do not work as I expected that would also be really helpful.

Edit: Example where 'most likely' label is taken up:

#Input data:
dput(dt)
structure(list(Gene = c("gene1", "gene2", "gene3", "gene4"), 
    F1 = c(1L, 0L, 0L, 1L), F2 = c(3L, 0L, 0L, 1L), F3 = c("1", 
    "1", "1", "least likely"), label = c(NA, NA, NA, 
    NA)), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

dt$label[(dt$F1 == 1)|(dt$F2 >= 3) |(dt$F1 == 1)] <- "most likely"
dt$label[(dt$label != 'most likely') & (dt$F1 == 2)|(dt$F2 == 0) |(dt$F1 == 1)] <- "probable"
dt$label[(dt$F1 == 0)|(dt$F2 == 0)] <- "unlikely"
dt$label[(dt$F3 == 'least likely')] <- "unknown"

thambisetty · ‎09-21-2020

I think you are looking for eval case statement.

syntax:
| eval newfield=case(condtion1,value to assign to newfield if condition1 matches,condition2, value to assign to newfield if condtion2 matches,...,default,default value)

SPL:

| eval label=case(Feature1=="1" AND Feature2=="3" AND Feature="1","most likely",1=1,"Other")

————————————
If this helps, give a like below.

How to create a column labeling rows based on conditions?

introduction

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk