Re: Best practice for representing bit flag fields...

Graham_Hanningt · ‎03-18-2016

Suppose I have a field that consists of a byte value, where each bit can represent a "flag": a property whose value is either true or false. In the definition of the record layout, the "parent" field (the byte) has a name, and so does each of the "child" bit flags.

For example, suppose I have a field named toppings that occupies one byte, where each bit represents whether or not a particular topping was added to a pizza:

anchovies
bacon
chilli
mushrooms
olives
pepperoni

(These names are fictional, but the structure matches actual fields in my data.)

Two of the bits are currently unused.

Now suppose I have the freedom to format that data in any way I choose before I get it into Splunk.

Some considerations:

Should I bother including the original byte value, as a number?
I'm tending towards "no", but suppose (I know, there's a lot of supposing going on here) we have zillions of these records, and for the foreseeable future we're only interested in whether the toppings included bacon or mushrooms, but there's a slim chance we might at some point also be interested in the others... so maybe we only break out bacon and mushrooms as separate properties for now. This runs the risk of forgetting what the other bits mean, or that their meaning has changed over time ... but it's cheaper to index fewer fields, unless you have an "all you can ingest" license.
Should the data be "sparse" or "dense"?
Let's say we've decided that we're interested in all of the toppings, and that the absence of a flag means "false". One problem: record formats can change over time; new flags can appear in data, and existing flags can become obsolete. If we introduce new toppings (say, onion and capers) and we've assumed that the absence of a flag means "false", then, when we analyze our data, if we don't keep in mind when onions and capers became available, we might mistakenly think that pizza eaters before a certain date eschewed those toppings. We've lost the distinction between "false" and absent (or null). (More realistically, for my use case: we might mistakenly think that a particular software property was "false", when in fact that property did not even exist in the version of the software that created the log record.)
If I use a data format such as JSON that supports nested structures, should I nest the bit flags under their parent, or should I keep a flat structure?

Some examples:

Example 1: dense JSON, nested

All available toppings represented.

"toppings": {
  "anchovies": true,
  "bacon": true,
  "chilli": true,
  "mushrooms": true,
  "olives": false,
  "pepperoni": false
}

Example 2: dense JSON, flat

"toppings_anchovies": true,  
"toppings_bacon": true,
"toppings_chilli": true,
"toppings_mushrooms": true,
"toppings_olives": false,
"toppings_pepperoni": false

Example 3: sparse JSON, nested

No overall byte value; only "true" properties present (others assumed "false"; literally, missing):

"toppings": {
   "bacon": true,
   "mushrooms": true
}

Example 4: sparse JSON, flat

There might have been other toppings.

"toppings_bacon": true,
"toppings_mushrooms": true

Example 5: sparse JSON, flat, with original byte value

There were other toppings than bacon and mushrooms, but you'd have to know how to interpret the byte value 240.

"toppings": 240,
"toppings_bacon": true,
"toppings_mushrooms": true

Summary

I think the "sparse" options (especially, where missing means false) are asking for trouble, but I thought I'd at least mention these options, because indexing data costs money.

So I think it's down to the "dense" options. In which case, I don't see the point in indexing the original numeric byte value.

But nested or flat? Nested means less data ingested (less repetition of the toppings qualifier), and I don't see any problems referring to nested properties such as toppings.anchovies.

But if I choose nested, then I think that rules out offering users the freedom of choice to ingest from either CSV or JSON, and then being able to use the same search strings in Splunk regardless of the input data format. Because the data ingested from CSV won't have the nested structure toppings.anchovies.

Thoughts and advice welcome.

martin_mueller · ‎04-21-2016

Have you considered {"toppings": ["pepperoni", "chilli"]}?
I'm leaning this way because pepperoni is a value, yet your examples all use it as a key. If your list of ingredients changes, your list of fields changes... or, in RDBMS terms, your schema changes. While it's of course possible to work with this, I consider it easier to have a variable number of rows compared to a variable number of columns (sticking with the RDBMS analogy).

Including the byte value won't hurt, so do it. Makes for an easy "distinct count of topping variations used", "most popular pizza by time of day", etc. calculation.

woodcock · ‎03-20-2016

This really cannot be answered without knowing your priorities. Nested vs. Flat is irrelevant because they will be parsed out the same and take roughly the same amount of time to process so go either way you(r deveopers) prefer. If your top priority is CapEx, then you need to preserve license and disk space so I would just pass in the toppings bit-flag field (#5 or #6) but this will have the worst search performance. If your top priority is readability, then go with #1 or #2. If you need a compromise, go with #3 or #4.

If you go with #5/#6, you can still build out lookups to handle creating the fields at search time.

Graham_Hanningt · ‎03-18-2016

I don't know why "5." appears in some of those code listings.

richgalloway · ‎03-18-2016

The "5." is a line number.

---
If this reply helps you, Karma would be appreciated.

Graham_Hanningt · ‎03-18-2016

Yeah, I figured out that much (and should have said so). Can I suppress it in my Markdown, or is it "beyond my control"?

richgalloway · ‎03-18-2016

I'm not aware of a way to eliminate the line numbers.

---
If this reply helps you, Karma would be appreciated.

Graham_Hanningt · ‎03-18-2016

In practice, the parent byte field names are unique within the context of a particular sourcetype... perhaps in conjunction with some other field. But the bit flag field names are not necessarily unique within that context, so their names either need to be qualified with a prefix (in the case of a "flat" structure), or they need to be nested in a parent property.

Best practice for representing bit flag fields in input data?

Example 1: dense JSON, nested

Example 2: dense JSON, flat

Example 3: sparse JSON, nested

Example 4: sparse JSON, flat

Example 5: sparse JSON, flat, with original byte value

Summary

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life