How parsing log increase size of raw log?

Na_Kang_Lim · ‎05-09-2025

Is the size of log after being stored in buckets compared to its raw size a metric I should monitor?

This question came in my mind and the problem is I don't really know how to measure it, since from the deployment/admin view, I can only view the size of a bucket. But a bucket can store logs from multiple hosts, and I don't know the size of the raw logs being sent from each host for a single bucket.

So is there any formulas to calculate?

AFAIK, the TRANSFORMS- function in props.conf is one of the main factors to increase size of log after being parsed, since it create index-time field extractions.

Also, if there is none exact formula, has anyone calculate the log after being parse when using well-known app like parsing WinEventLog or Linux?

PickleRick · ‎05-09-2025

It's complicated 😁

Firstly, Splunk stores contents of raw data. Compressed. The compression ratio is more or less known in a typical text data scenario so we could estimate raw data usage.

But that's definitely not all that Splunk stores about its indexed data.

Firstly, it splits the data on major breakers and minor breakers and stores the resulting tokens along with "pointers" to the events they contain (there are some minute details about minor breakers but we'll not be digging into them here). So if you have an event

Jan 23 2025 localhost test[23418] sample event

Splunk will store the whole raw event in its raw event journal and will generate a "pointer" to that event within that journal (for the sake of this example we will assume the value of this "pointer" is 0xDEADBEEF; it doesn't matter what it looks like in reality internally)

Addiitonally, Splunk will split the data and will add separate entries to its index with pointers to the original raw event from each of the split tokens. So the index will contain

Token	Pointers
2025	0xDEADBEEF
23	0xDEADBEEF
23418	0xDEADBEEF
event	0xDEADBEEF
jan	0xDEADBEEF
localhost	0xDEADBEEF
sample	0xDEADBEEF
test	0xDEADBEEF

Now if Splunk ingests another event

Jan 24 2025 localhost test[23418] another sample event

It will save it to the raw data journal, assign it another pointer - let's say it's 0x800A18A0.

And it will update its index so that it contains now

Token	Pointers
2025	0xDEADBEEF,0x800A18A0
23	0xDEADBEEF
24	0x800A18A0
23418	0xDEADBEEF,0x800A18A0
another	0x800A18A0
event	0xDEADBEEF,0x800A18A0
jan	0xDEADBEEF,0x800A18A0
localhost	0xDEADBEEF,0x800A18A0
sample	0xDEADBEEF,0x800A18A0
test	0xDEADBEEF,0x800A18A0

So you can see that the actual index contents are highly dependent on the entropy of the data.

If you have just one value or a small set of values which simply repeat throughout your whole data stream, the index will contain just a small set of unique values with a lot of "pointers" to the raw events.

But if your events contain unique tokens, the index will grow in terms of indexed values and each of them will be pointing to just one raw event.

So that's already complicated 😉

Additionally, if you create indexed fields, they are actually stored in the same index as the tokens parsed out of the raw event, they're just stored with the field name prefix. So if you created an indexed field called "mytestfield" with a value of "value1", it will be stored in the same index as the tokens, but it will just be saved as "mytestfield::value1". As an interesting trivia - indexed fields are undistinguishable of key::value tokens parsed out of raw data.

So indeed, if you're creating indexed fields, you cause the index to grow. There is no simple linear dependency though since the growth depends on the cardinality of the field, the size of the field values (the size of the name itself too), and the number of events to which the index entry has to point.

Additionally, Splunk stores some simple summarizing csv files (which are actually relatively negligible in size) as well as bloomfilter which is a kind of a simplified index containing just the tokens, without the relevant pointers. It might seem a bit redundant but is actually pretty useful - Splunk can determine whether to look for the term at all in the bucket without needing to process the full index which might be way bigger. So Splunk can simply skip searching through the particular bucket if it knows it won't find anything.

So long story short - it's relatively complicated and there is no simple formula to give you a sure estimate how your data will grow if you - for example - add a single indexed field. The rule of thumb is that the "core" indexed data (the raw events along with essential metadata fields) is about 15% of the original size of the raw data and the indexes add another 35% of the original size of the raw data. But it's only the generalized estimation. There's no way to reliably calculate it beforehand since there are many factors that come into play.

How parsing log increase size of raw log?

indexer

indexing performance

other

search performance

Introduction to Splunk AI

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Maximizing the Value of Splunk ES 8.x

Are you a member of the Splunk Community?

How parsing log increase size of raw log?

indexer

indexing performance

other

search performance

Introduction to Splunk AI

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Maximizing the Value of Splunk ES 8.x