I've read various topics on license usage, but I'm still confused about the basic calculation: exactly which bytes count as license usage?
A possible answer might be: the number of bytes in the _raw
field. But I recognize that might be simplistic, or at least incomplete.
My own - possibly faulty - experiments indicate that "number of bytes" is simplistic, at least in the following regard: len()
appears to count multibyte UTF-8 characters as 1, as I'd hope. So, "number of characters", then, depending on the character set encoding used by Splunk to interpret the length of a string.
The recent Splunk blog post "What size should my Splunk license be?" includes the following command in a search:
eval evt_bytes = len(_raw)
The naming of that field - specifically, the trailing term _bytes
- makes me think that I might be wrong about how len()
treats multibyte characters.
However, I'm unsure, and - with apologies to the blog post author if I've missed it - the blog post doesn't describe, whether the b
field from index=_internal source=*license_usage.log type=Usage
is simply a total of evt_bytes
, or includes other bytes, or is not based on len(_raw)
at all.
For example, if I send Splunk the following JSON-formatted event via TCP:
{"time":"2016-05-20 09:00:00.000","myfield":"myvalue"}\r\n
(where \r\n
represents two bytes: a "carriage return/linefeed pair")
consisting of 56 bytes (if you include the trailing \r\n
)
then what exactly is this event's contribution to license usage? 56 bytes? Or 54 bytes (if the \r\n
is not included)? Or a higher number, to account for Splunk internal field values associated with this event?
While I'm asking (with apologies if readers think this should be a separate question)... if I send the same event via the HTTP Event Collector:
{"time":1463734800,"event":{"myfield":"myvalue"}}
then do I save on license usage by having the time stamp as metadata, rather than in the event data (that becomes the _raw
field)?
Before asking this question, I considered performing my own tests, indexing single events (via TCP and HEC) into brand new indexes, and then looking at the corresponding b
field values in the log file. I might still do that, but I have limited time, and anyway, I'd like to know what the figures should show, so that, if I do these tests, I can confirm or deny that (or, more likely, figure out where I've gone wrong in my testing 🙂 ).
Ok len(_raw) works because each ASCII character = 8 binary bits = 1 byte on disk ... so the word four is 4 bytes, the word OMG is 3 bytes, the number 456 in string format is 3 bytes, the string "hey 1234" is 8 bytes and so on.
so that's why getting the length of the raw field equates to bytes. But that only works if you're working with ASCII encoding:
http://stackoverflow.com/questions/1049139/do-certain-characters-take-more-bytes-than-others
{"time":"2016-05-20 09:00:00.000","myfield":"myvalue"}\r\n <- would drop the \r\n, time is stored in epoch in the index (IF EXTRAPOLATED correctly, but your event would remain this total length in size according to license usage)
{"time":1463734800,"event":{"myfield":"myvalue"}} <- time is stored in epoch in the index (IF EXTRAPOLATED correctly), and you would save on your license because the timestamp is smaller
Further savings would come from this:
{"time":1463734800,"event1":{"myfield":"myvalue"},"event2":{"myfield":"myvalue"},"event3":{"myfield":"myvalue"}}
As there would be only one time stamp and 3 events. Json is not much fun to play with though... see this post https://answers.splunk.com/answering/401972/view.html where I recently learned the horrors of "nested json"
@jkat54, thanks for your answer.
Re:
the timestamp is smaller
I don't understand. Smaller? How?
In my TCP example, the event data contains a time
field that Splunk uses to set the internal _time
field according to TIME_PREFIX
and TIME_FORMAT
settings in props.conf
. And that time
field value also appears with the rest of the event data in the _raw
field.
In my HEC example, the event
key does not contain a time
field. Instead, the event time stamp is specified in the metadata time
key that Splunk uses to set the _time
field.
In both examples, the indexed event has an internal _index
field (Unix Epoch time value).
However, only the _raw
field for the event received via TCP contains a time
field. The _raw
field for the event received via HEC does not contain a time stamp value. That is where I see the potential saving in license usage: the absence of a time stamp value from the _raw
field. Is that what you meant by "smaller"?
(And when you wrote "extrapolated", did you mean "extracted"?)