Monitoring Splunk

Exactly which bytes count as license usage?

Graham_Hanningt
Builder

I've read various topics on license usage, but I'm still confused about the basic calculation: exactly which bytes count as license usage?

A possible answer might be: the number of bytes in the _raw field. But I recognize that might be simplistic, or at least incomplete.

My own - possibly faulty - experiments indicate that "number of bytes" is simplistic, at least in the following regard: len() appears to count multibyte UTF-8 characters as 1, as I'd hope. So, "number of characters", then, depending on the character set encoding used by Splunk to interpret the length of a string.

The recent Splunk blog post "What size should my Splunk license be?" includes the following command in a search:

eval evt_bytes = len(_raw)

The naming of that field - specifically, the trailing term _bytes - makes me think that I might be wrong about how len() treats multibyte characters.

However, I'm unsure, and - with apologies to the blog post author if I've missed it - the blog post doesn't describe, whether the b field from index=_internal source=*license_usage.log type=Usage is simply a total of evt_bytes, or includes other bytes, or is not based on len(_raw) at all.

For example, if I send Splunk the following JSON-formatted event via TCP:

{"time":"2016-05-20 09:00:00.000","myfield":"myvalue"}\r\n

(where \r\n represents two bytes: a "carriage return/linefeed pair")

consisting of 56 bytes (if you include the trailing \r\n)

then what exactly is this event's contribution to license usage? 56 bytes? Or 54 bytes (if the \r\n is not included)? Or a higher number, to account for Splunk internal field values associated with this event?

While I'm asking (with apologies if readers think this should be a separate question)... if I send the same event via the HTTP Event Collector:

{"time":1463734800,"event":{"myfield":"myvalue"}}

then do I save on license usage by having the time stamp as metadata, rather than in the event data (that becomes the _raw field)?

Before asking this question, I considered performing my own tests, indexing single events (via TCP and HEC) into brand new indexes, and then looking at the corresponding b field values in the log file. I might still do that, but I have limited time, and anyway, I'd like to know what the figures should show, so that, if I do these tests, I can confirm or deny that (or, more likely, figure out where I've gone wrong in my testing 🙂 ).

0 Karma

jkat54
SplunkTrust
SplunkTrust

Ok len(_raw) works because each ASCII character = 8 binary bits = 1 byte on disk ... so the word four is 4 bytes, the word OMG is 3 bytes, the number 456 in string format is 3 bytes, the string "hey 1234" is 8 bytes and so on.

so that's why getting the length of the raw field equates to bytes. But that only works if you're working with ASCII encoding:
http://stackoverflow.com/questions/1049139/do-certain-characters-take-more-bytes-than-others

{"time":"2016-05-20 09:00:00.000","myfield":"myvalue"}\r\n <- would drop the \r\n, time is stored in epoch in the index (IF EXTRAPOLATED correctly, but your event would remain this total length in size according to license usage)

{"time":1463734800,"event":{"myfield":"myvalue"}} <- time is stored in epoch in the index (IF EXTRAPOLATED correctly), and you would save on your license because the timestamp is smaller

Further savings would come from this:
{"time":1463734800,"event1":{"myfield":"myvalue"},"event2":{"myfield":"myvalue"},"event3":{"myfield":"myvalue"}}

As there would be only one time stamp and 3 events. Json is not much fun to play with though... see this post https://answers.splunk.com/answering/401972/view.html where I recently learned the horrors of "nested json"

0 Karma

Graham_Hanningt
Builder

@jkat54, thanks for your answer.

Re:

the timestamp is smaller

I don't understand. Smaller? How?

In my TCP example, the event data contains a time field that Splunk uses to set the internal _time field according to TIME_PREFIX and TIME_FORMAT settings in props.conf. And that time field value also appears with the rest of the event data in the _raw field.

In my HEC example, the event key does not contain a time field. Instead, the event time stamp is specified in the metadata time key that Splunk uses to set the _time field.

In both examples, the indexed event has an internal _index field (Unix Epoch time value).

However, only the _raw field for the event received via TCP contains a time field. The _raw field for the event received via HEC does not contain a time stamp value. That is where I see the potential saving in license usage: the absence of a time stamp value from the _raw field. Is that what you meant by "smaller"?

(And when you wrote "extrapolated", did you mean "extracted"?)

0 Karma
Get Updates on the Splunk Community!

Monitoring Postgres with OpenTelemetry

Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power ...

Mastering Synthetic Browser Testing: Pro Tips to Keep Your Web App Running Smoothly

To start, if you're new to synthetic monitoring, I recommend exploring this synthetic monitoring overview. In ...

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...