Getting Data In

How is the Index Size greater than the uncompressed raw data size?

Deepali529
Explorer

Uploaded File size: 717MB
Current Index size: 811MB ( settings -> Data -> Indexes )
Index Size: 0.79 GB ( Monitoring Console -> Indexing -> Indexes and Volumes -> Index Detail Instance -> overview )

The index size created should be less than the file size, but it is larger than file uploaded.
Previously, when I uploaded the same file combining with another file of 14 MB, the Index size was 706 MB, whereas now it is opposite. Size should have been compressed.

Can anybody please explain this?

Thanks and Regards

0 Karma
1 Solution

lguinn2
Legend

The size of the index on disk depends on several factors. It is entirely possible for the index to consume more space than the incoming file does. When Splunk indexes a file, it creates one or more buckets in the index. Each bucket contains two main kinds of files:

  • "rawdata" = the incoming data, plus timestamp, host, source and sourcetype, stored in a journaled, compressed file. The "rawdata" is compressed via gzip, so it generally equals about 15% of the inbound data size. However, this depends on how well the incoming data compresses.

  • index files = the keyword index, the bloom filters, metadata files, and various other index files. The size of these files is highly dependent on the number of unique keywords in the incoming data; indexed field extractions also increase the size of the index files. The size of these files can vary widely, but generally falls between 10% - 110% of the incoming data size.

If the size of the index changed between uploads, perhaps someone created indexed field extractions.

View solution in original post

lguinn2
Legend

The size of the index on disk depends on several factors. It is entirely possible for the index to consume more space than the incoming file does. When Splunk indexes a file, it creates one or more buckets in the index. Each bucket contains two main kinds of files:

  • "rawdata" = the incoming data, plus timestamp, host, source and sourcetype, stored in a journaled, compressed file. The "rawdata" is compressed via gzip, so it generally equals about 15% of the inbound data size. However, this depends on how well the incoming data compresses.

  • index files = the keyword index, the bloom filters, metadata files, and various other index files. The size of these files is highly dependent on the number of unique keywords in the incoming data; indexed field extractions also increase the size of the index files. The size of these files can vary widely, but generally falls between 10% - 110% of the incoming data size.

If the size of the index changed between uploads, perhaps someone created indexed field extractions.

richgalloway
SplunkTrust
SplunkTrust

Did you clean out the index between uploads? If not, the index now contains multiple copies of the uploaded file which might explain why it's bigger than the source.

---
If this reply helps you, Karma would be appreciated.

rjthibod
Champion

Per @richgalloway, please clarify what exactly you did to the index between uploads. For example, using the delete SPL command does not actually remove the data from the index.

0 Karma

Deepali529
Explorer

Hi, I have just uploaded the file like usually we do. No command I have used

0 Karma

somesoni2
SplunkTrust
SplunkTrust

Try to upload it to a new index and compare. It could be possible that some data was leftover from your previous uploads, so uploading to new index will ensure that won't happen.

0 Karma

Deepali529
Explorer

Hi, I deleted the previous index and cleaned the system. Then tried uploading the file. It's showing the correct word count as "3465010" on linux box as well as on Splunk.
But index size is 945 MB and file size is 731 MB.
I am not able to understand how can this be possible.

0 Karma

puneethgowda
Communicator

what is the file type csv or text file ?

0 Karma

Deepali529
Explorer

Hi, there is only one file present in index.

File size:717 MB
Index size: 811 MB

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...