Hi,
I have indexed 6GB of CSV data in Splunk. When I look at the compression rate using this search:
| dbinspect index=sca_rs_index2
| fields state,id,rawSize,sizeOnDiskMB
| stats sum(rawSize) AS rawTotal, sum(sizeOnDiskMB) AS diskTotalinMB
| eval rawTotalinMB=(rawTotal / 1024 / 1024) | fields - rawTotal
| eval compression=tostring(round(diskTotalinMB / rawTotalinMB * 100, 2)) + "%"
| table rawTotalinMB, diskTotalinMB, compression
It gives me 500%.
Indeed, when I execute du -h /opt/splunk/var/lib/splunk/sca_rs_index2
, I got 25GB.
The CSV files contain mainly float data. So, I can understand that the compression ratio would not be so good, but not five times bigger. Here is a sample line of the indexed files:
156,Jun-25-2015 03:53:56:765 PM (CEST),4.46,4.24,-33.79,-36.85,2.9007191883900015E-9,2.9265995391399987E-9,7.07,12.803,2.184,2.128,1.7805797178279998E-9,1.8405463108899995E-9,-0.8554831,-1.4905629,-0.23367512,-0.04267813,-0.85057795,-1.4899217,-0.24520056,-0.03800104,2.9247010642799994E-9,2.91267178384E-9,1.3790839E-7,1.36082E-7,0.38341087,97,156,2.9231726E-9,0.02,2.8500933E-9,2.996252E-9,2.760363E-9,3.0509275E-9,2.6058726E-9,3.1849554E-9,1.3411523E-9,2.2089566E-9,-1.0E-8,1.9596868E-7,-1.0E-8,1.959884E-7,,,-0.009,0.55262375,-0.9074077,-0.8010149,-1.5971898,-1.379061,-0.5492179,0.022210669,-0.3409165,0.23051207,-0.90580165,-0.79919446,-1.5967877,-1.3784232,-0.5492179,0.022210669,-0.3409165,0.23051207,-35.85059,-31.78982,-39.128872,-34.6483,-2.9167533,7.0832467,-2.962398,7.037602,3.2690098,2.9473321,1.4245242,7.4673486,12.490728347227515
I have 5 accelerated reports on this data. There is a subfolder summary in /opt/splunk/var/lib/splunk/sca_rs_index2. Does it contain the summary information?
What could be improved? How can I investigate this further?
Thanks,
As explained in the question, the inbound data size is 6GB and the space occupied on disk is 25GB. So, the ratio between the 2 is around 500% as well.
Replacing the INDEXED_EXTRACTIONS = CSV by a DELIM extraction in transforms.conf did significantly improved the situation:
- inbound data size: 6GB
- space occupied on disk: 8,5GB
- Ratio disk space / inbound data : 140%
- Ratio between index size and compressed data as returned by the query : 180%
As explained in the question, the inbound data size is 6GB and the space occupied on disk is 25GB. So, the ratio between the 2 is around 500% as well.
Replacing the INDEXED_EXTRACTIONS = CSV by a DELIM extraction in transforms.conf did significantly improved the situation:
- inbound data size: 6GB
- space occupied on disk: 8,5GB
- Ratio disk space / inbound data : 140%
- Ratio between index size and compressed data as returned by the query : 180%
First, the majority of the bucket is the keyword index for almost every typical input - Windows event logs, Linux logs, etc. Assume that you have the following
incoming data of 100 MB
raw data in bucket averages 15% of incoming data
index files in bucket averages 35% of incoming data
This is the common example given for "syslog-like" data. If you compare rawdata size to index file size in this example, the result is 233%
and that is the same comparison that your search does.
Your data is likely to generate much larger index files because of its contents, as you noted; there seems to be little commonality between "keywords." And using indexed fields will make the index files larger yet. So 500% seems totally realistic to me.
Remember that the documentation, course materials, etc. do not compare the compressed rawdata size of the bucket with the index file size of the bucket - we normally compare the inbound data size with the bucket size, etc.
So the real compression rate that you should care about is: given X amount of inbound data, how much disk space (rawdata and index files) will Splunk need?
BTW, If I compress the indexed folder with tar, the size is 1,5GB.
mmm... I discovered that INDEXED_EXTRACTIONS = CSV was set for this sourcetype. I need to investigate this.
That setting can be very very bad and while explode your index size.