topic Re: How to obtain valid value of a field among known bad ones in Splunk Search

How to obtain valid value of a field among known bad ones

yuanliu — Thu, 10 Jun 2021 17:50:02 GMT

I have some data with flip-flop values akin to the following simulation

| makeresults count=20 | eval id = "id" . (random() % 2 + 1), bad = case(random() % 3 == 0, "real " . id, random() % 5 == 0, "", true(), "fake")

What it does is to insert a known "fake" value, and zero-length string in some events. (Comically, the fake value is the string literal "null".) A sample data set could look like

_time	bad	id
2021-06-09 18:45:09	fake	id2
2021-06-09 18:45:09	fake	id2
2021-06-09 18:45:09	fake	id1
2021-06-09 18:45:09	fake	id2
2021-06-09 18:45:09	fake	id1
2021-06-09 18:45:09	real id2	id2
2021-06-09 18:45:09	real id1	id1
2021-06-09 18:45:09	fake	id2
2021-06-09 18:45:09	real id1	id1
2021-06-09 18:45:09	fake	id2
2021-06-09 18:45:09	real id2	id2
2021-06-09 18:45:09		id2
2021-06-09 18:45:09	fake	id1
2021-06-09 18:45:09	fake	id1
2021-06-09 18:45:09		id2
2021-06-09 18:45:09	fake	id2
2021-06-09 18:45:09	fake	id1
2021-06-09 18:45:09	real id1	id1
2021-06-09 18:45:09	real id2	id2
2021-06-09 18:45:09	fake	id2

For my purposes, all I care is that any field whose value is not "fake" or "" be extracted. The challenge is, of course, that some records may have bad value for "legitimate" reasons (e.g., data entry or machine errors) so all their values are "fake". I also cannot predict whether all fields are having "fake" in the same record, i.e., "real" value for different fields may be contained in different events.

One method I come up with is to mvfilter() from bad, like so

| dedup id bad | eventstats values(bad) as bad by id | eval realbad = mvfilter(NOT match(bad, "fake"))

(dedup is added just to reduce load for eventstats. Not sure if that makes a difference.) Lucky for me, values() seems to have dropped those zero-length values. (For my purposes, it is OK if this field becomes null, in case all values are zero-length or fake.) Otherwise, I'll just add another test in mvfilter().

Whereas this produces desired results, I feel that eventstats+mvfilter() could still be expensive. Any standard/recommended methods?

(Clarification: Whereas sample simulation illustrates one 'bad' field, there are multiple bad fields that flip-flop seemingly independent of each other, so globally filtering out all events containing 'invalid' values is not an option because the use case requires valid value of as many fields as possible.)

Re: How to obtain valid value of a field among known bad ones

ITWhisperer — Wed, 09 Jun 2021 22:25:21 GMT

I am not sure I understand what you are trying to do - can you not just

| where bad!="fake" AND bad!=""

Re: How to obtain valid value of a field among known bad ones

yuanliu — Thu, 10 Jun 2021 05:50:18 GMT

The reason why that will not work is because we don't know which events contain valid values for which field. (I probably oversimplified in sample simulation. There are quite a number of bad fields kind of flip-flop in unpredictable ways.)

Re: How to obtain valid value of a field among known bad ones

yuanliu — Wed, 16 Jun 2021 02:07:30 GMT

Lucky for me, values() seems to have dropped those zero-length values. (For my purposes, it is OK if this field becomes null, in case all values are zero-length or fake.)

Two problems:

values() drops empty string values only when non-empty values exist. When all values are empty, the result is still an empty string.
If all values are "fake", the result is still "fake".

Both empty string and "fake" are more difficult for downstream processing; a mixture of two values to represent bad states is worse for my purposes.

Additionally, eventstats is useless for this illustration and not used in my use case. I also realize that match() is totally unnecessary when string identity will do better. It is almost as if brute force is better.

| dedup bad id | eval bad = if(bad IN ("", "fake"), null(), bad) | stats values(*) as * by id

As mentioned in another comment, I have multiple "bad" fields.

With this workaround, each of them still needs to eval separately, but at least values() can use wildcard.