Splunk Search

How to obtain valid value of a field among known bad ones

yuanliu
SplunkTrust
SplunkTrust

I have some data with flip-flop values akin to the following simulation

 

| makeresults count=20
| eval id = "id" . (random() % 2 + 1), bad = case(random() % 3 == 0, "real " . id, random() % 5 == 0, "", true(), "fake")

 

What it does is to insert a known "fake" value, and zero-length string in some events. (Comically, the fake value is the string literal "null".) A sample data set could look like

_timebadid
2021-06-09 18:45:09fakeid2
2021-06-09 18:45:09fakeid2
2021-06-09 18:45:09fakeid1
2021-06-09 18:45:09fakeid2
2021-06-09 18:45:09fakeid1
2021-06-09 18:45:09real id2id2
2021-06-09 18:45:09real id1id1
2021-06-09 18:45:09fakeid2
2021-06-09 18:45:09real id1id1
2021-06-09 18:45:09fakeid2
2021-06-09 18:45:09real id2id2
2021-06-09 18:45:09 id2
2021-06-09 18:45:09fakeid1
2021-06-09 18:45:09fakeid1
2021-06-09 18:45:09 id2
2021-06-09 18:45:09fakeid2
2021-06-09 18:45:09fakeid1
2021-06-09 18:45:09real id1id1
2021-06-09 18:45:09real id2id2
2021-06-09 18:45:09fakeid2

For my purposes, all I care is that any field whose value is not "fake" or "" be extracted.  The challenge is, of course, that some records may have bad value for "legitimate" reasons (e.g., data entry or machine errors) so all their values are "fake".  I also cannot predict whether all fields are having "fake" in the same record, i.e., "real" value for different fields may be contained in different events.

One method I come up with is to mvfilter() from bad, like so

 

| dedup id bad
| eventstats values(bad) as bad by id
| eval realbad = mvfilter(NOT match(bad, "fake"))

 

(dedup is added just to reduce load for eventstats.  Not sure if that makes a difference.) Lucky for me, values() seems to have dropped those zero-length values. (For my purposes, it is OK if this field becomes null, in case all values are zero-length or fake.) Otherwise, I'll just add another test in mvfilter().

Whereas this produces desired results, I feel that eventstats+mvfilter() could still be expensive.  Any standard/recommended methods?

(Clarification: Whereas sample simulation illustrates one 'bad' field, there are multiple bad fields that flip-flop seemingly independent of each other, so globally filtering out all events containing 'invalid' values is not an option because the use case requires valid value of as many fields as possible.)

Labels (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Lucky for me, values() seems to have dropped those zero-length values. (For my purposes, it is OK if this field becomes null, in case all values are zero-length or fake.)

Two problems:

  1. values() drops empty string values only when non-empty values exist.  When all values are empty, the result is still an empty string. 
  2. If all values are "fake", the result is still "fake".

Both empty string and "fake" are more difficult for downstream processing; a mixture of two values to represent bad states is worse for my purposes.

Additionally, eventstats is useless for this illustration and not used in my use case.  I also realize that match() is totally unnecessary when string identity will do better.  It is almost as if brute force is better.

 

 

| dedup bad id
| eval bad = if(bad IN ("", "fake"), null(), bad)
| stats values(*) as * by id

 

 

As mentioned in another comment, I have multiple "bad" fields.

With this workaround, each of them still needs to eval separately, but at least values() can use wildcard.

View solution in original post

0 Karma

yuanliu
SplunkTrust
SplunkTrust

Lucky for me, values() seems to have dropped those zero-length values. (For my purposes, it is OK if this field becomes null, in case all values are zero-length or fake.)

Two problems:

  1. values() drops empty string values only when non-empty values exist.  When all values are empty, the result is still an empty string. 
  2. If all values are "fake", the result is still "fake".

Both empty string and "fake" are more difficult for downstream processing; a mixture of two values to represent bad states is worse for my purposes.

Additionally, eventstats is useless for this illustration and not used in my use case.  I also realize that match() is totally unnecessary when string identity will do better.  It is almost as if brute force is better.

 

 

| dedup bad id
| eval bad = if(bad IN ("", "fake"), null(), bad)
| stats values(*) as * by id

 

 

As mentioned in another comment, I have multiple "bad" fields.

With this workaround, each of them still needs to eval separately, but at least values() can use wildcard.

0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

I am not sure I understand what you are trying to do - can you not just 

| where bad!="fake" AND bad!=""
0 Karma

yuanliu
SplunkTrust
SplunkTrust

The reason why that will not work is because we don't know which events contain valid values for which field. (I probably oversimplified in sample simulation.  There are quite a number of bad fields kind of flip-flop in unpredictable ways.)

0 Karma
Get Updates on the Splunk Community!

A Season of Skills: New Splunk Courses to Light Up Your Learning Journey

There’s something special about this time of year—maybe it’s the glow of the holidays, maybe it’s the ...

Announcing the Migration of the Splunk Add-on for Microsoft Azure Inputs to ...

Announcing the Migration of the Splunk Add-on for Microsoft Azure Inputs to Officially Supported Splunk ...

Splunk Observability for AI

Don’t miss out on an exciting Tech Talk on Splunk Observability for AI! Discover how Splunk’s agentic AI ...