Getting Data In

Splunk disk size much larger than original data set. Why?

SunDance
Explorer

Hello,
We are having some trouble understanding the disk size requirements for storing our data set in Splunk.

The data set:
JSON files compressed in zip archives.
A json file contains entries like this: {"n":1360799988083479,"v":40118552},...
where n is microsecond precision UTC timestamp and v is a value (mostly doubles, longs, ints but can be strings as well).
We currently have around 2 TB of such compressed zip archives.

Splunk side:
The structure of the data allows splitting it in around 100 classes. Because each class corresponds to sources of information with potentially different behavior with respect to the frequency of n/v pairs, we decided to have one Splunk index per class.

We did a little investigation with one class. We inserted data with the default Splunk configuration and found that the Splunk index files (.tsidx) were way larger than we expected and were dominated our storage space. We then inserted the same data specifying NO SEGMENTATION and found the Splunk index files (.tsidx) to be much smaller (order of magnitude). We understand that with NO SEGMENTATION we get the most space efficient size on disk at the expense of loosing full text search capabilities. We are fine with this since our data set is fairly structured and we would not use this feature anyway.

Despite NO SEGMENTATION lowers our Splunk disk size estimate to 12TB we still feel that 12TB is too large for the original compressed 2TB (See details below).

Why is the data in Splunk expanding so much for our data set?
Is this normal with Splunk?
Is there anything more we can do to optimize this disk space?

Cheers,
Alex

Our disk space estimate is:
500MB compressed data expands to 1.81 GB in Splunk => a factor 4!
=> our whole data set of 2 TB will be 8 TB.
Adding another replica of data (which we estimate to be 4 TB because the journal.gz files seem to be 2x the initial compressed zips) leads to a total of: 12 TB.
So we go from 2TB to 12TB.

Tags (1)

Drainy
Champion

I think the problem here is that unlike a lot of other users you're moving from compressed files to more compressed files, thus you may be finding the size inflating.

My suggestion would be that the metadata added to the rawdata is causing a lot of this inflation, Splunk will append the usual metadata, host,source, sourcetype and so on to each event in the .gz file which can cause it to inflate. Normally users would experience a reduction of around 40-50% in data size due to the compression Splunk applies but in your case, its decompressing your data, appending additional metadata and then storing it in a compressed archive.

In fact thinking about it, each event is so small that the metadata is definitely longer than the original event.. well you would probably experience this kind of inflation. Each event is effectively more than doubling in length before compression.

Could you perhaps change the event breaking and search over multi-line events instead? Probably not but I thought I'd throw it out there

SunDance
Explorer

Thnks for ur comments
Drainy is right,metadata dominates.Each event is small especially compared to "source".
Does the metadata count in what Splunk is seeing as input rate(against the daily quota)?
You suggest searching over multiline events decreases the size on disk because metadata is added once for multiple events instead of per event.Is this the only way?I am wondering what are the downsides of doing this for querying?You have to split events into time intervals 5m,1h?
Now each archive file contains 1week of data.Is there a way to break the week down into smaller interval at index time?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

On top of that metadata, splunk is appending to the compressed file so compression strength must suffer. I've tested recompressing a journal file, and it easily drops by 10% without changing the content.

0 Karma

kristian_kolb
Ultra Champion

I haven't played around with segmentation, but I'd thought that zipped source files would be somewhat similar in size as the journal.gz. ~1:1 ratio....

Could it have to do with size/speed relationship between compression algorithms? I'd understand why Splunk wouldn't sacrifice a too much speed when creating the journal.gz, but still...

Interesting!

/k

0 Karma
Get Updates on the Splunk Community!

3 Ways to Make OpenTelemetry Even Better

My role as an Observability Specialist at Splunk provides me with the opportunity to work with customers of ...

What's New in Splunk Cloud Platform 9.2.2406?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2406 with many ...

Enterprise Security Content Update (ESCU) | New Releases

In August, the Splunk Threat Research Team had 3 releases of new security content via the Enterprise Security ...