Monitoring Splunk

How parsing log increase size of raw log?

Na_Kang_Lim
Explorer

Is the size of log after being stored in buckets compared to its raw size a metric I should monitor?

This question came in my mind and the problem is I don't really know how to measure it, since from the deployment/admin view, I can only view the size of a bucket. But a bucket can store logs from multiple hosts, and I don't know the size of the raw logs being sent from each host for a single bucket.

So is there any formulas to calculate?

AFAIK, the TRANSFORMS- function in props.conf is one of the main factors to increase size of log after being parsed, since it create index-time field extractions.

Also, if there is none exact formula, has anyone calculate the log after being parse when using well-known app like parsing WinEventLog or Linux?

0 Karma

PickleRick
SplunkTrust
SplunkTrust

It's complicated 😁

Firstly, Splunk stores contents of raw data. Compressed. The compression ratio is more or less known in a typical text data scenario so we could estimate raw data usage.

But that's definitely not all that Splunk stores about its indexed data.

Firstly, it splits the data on major breakers and minor breakers and stores the resulting tokens along with "pointers" to the events they contain (there are some minute details about minor breakers but we'll not be digging into them here). So if you have an event

Jan 23 2025 localhost test[23418] sample event

Splunk will store the whole raw event in its raw event journal and will generate a "pointer" to that event within that journal (for the sake of this example we will assume the value of this "pointer" is 0xDEADBEEF; it doesn't matter what it looks like in reality internally)

Addiitonally, Splunk will split the data and will add separate entries to its index with pointers to the original raw event from each of the split tokens. So the index will contain

TokenPointers
20250xDEADBEEF
230xDEADBEEF
234180xDEADBEEF
event0xDEADBEEF
jan0xDEADBEEF
localhost0xDEADBEEF
sample0xDEADBEEF
test0xDEADBEEF

Now if Splunk ingests another event

Jan 24 2025 localhost test[23418] another sample event

It will save it to the raw data journal, assign it another pointer - let's say it's 0x800A18A0.

And it will update its index so that it contains now

TokenPointers
20250xDEADBEEF,0x800A18A0
230xDEADBEEF
240x800A18A0
234180xDEADBEEF,0x800A18A0
another0x800A18A0
event0xDEADBEEF,0x800A18A0
jan0xDEADBEEF,0x800A18A0
localhost0xDEADBEEF,0x800A18A0
sample0xDEADBEEF,0x800A18A0
test0xDEADBEEF,0x800A18A0

So you can see that the actual index contents are highly dependent on the entropy of the data.

If you have just one value or a small set of values which simply repeat throughout your whole data stream, the index will contain just a small set of unique values with a lot of "pointers" to the raw events.

But if your events contain unique tokens, the index will grow in terms of indexed values and each of them will be pointing to just one raw event.

So that's already complicated 😉

Additionally, if you create indexed fields, they are actually stored in the same index as the tokens parsed out of the raw event, they're just stored with the field name prefix. So if you created an indexed field called "mytestfield" with a value of "value1", it will be stored in the same index as the tokens, but it will just be saved as "mytestfield::value1". As an interesting trivia - indexed fields are undistinguishable of key::value tokens parsed out of raw data.

So indeed, if you're creating indexed fields, you cause the index to grow. There is no simple linear dependency though since the growth depends on the cardinality of the field, the size of the field values (the size of the name itself too), and the number of events to which the index entry has to point.

Additionally, Splunk stores some simple summarizing csv files (which are actually relatively negligible in size) as well as bloomfilter which is a kind of a simplified index containing just the tokens, without the relevant pointers. It might seem a bit redundant but is actually pretty useful - Splunk can determine whether to look for the term at all in the bucket without needing to process the full index which might be way bigger. So Splunk can simply skip searching through the particular bucket if it knows it won't find anything.

So long story short - it's relatively complicated and there is no simple formula to give you a sure estimate how your data will grow if you - for example - add a single indexed field. The rule of thumb is that the "core" indexed data (the raw events along with essential metadata fields) is about 15% of the original size of the raw data and the indexes add another 35% of the original size of the raw data. But it's only the generalized estimation. There's no way to reliably calculate it beforehand since there are many factors that come into play.

Get Updates on the Splunk Community!

Aligning Observability Costs with Business Value: Practical Strategies

 Join us for an engaging Tech Talk on Aligning Observability Costs with Business Value: Practical ...

Mastering Data Pipelines: Unlocking Value with Splunk

 In today's AI-driven world, organizations must balance the challenges of managing the explosion of data with ...

Splunk Up Your Game: Why It's Time to Embrace Python 3.9+ and OpenSSL 3.0

Did you know that for Splunk Enterprise 9.4, Python 3.9 is the default interpreter? This shift is not just a ...