I can search for compression settings information all day long and currently we only compress at 34% overall (Firebrigade) That seems to be a small number. When I individually search on indexes I do see that compression is higher for some than others.
I am using this query to get an overall understanding of the percentage of compressed values over my indexes.
index=summary orig_index=* | rename orig_index AS index | dedup host, path | search state="warm" | chart sum(rawSize) AS rawBytes, sum(sizeOnDiskMB) AS diskTotalinMB by index | eval rawTotalinMB=round(rawBytes / 1024 / 1024, 0) | eval comp_percent=tostring(round(diskTotalinMB / rawTotalinMB * 100, 2)) + "%"
I am hoping someone can answer the million dollar question. How do I change this setting that is both advantageous to my data costs but does not hamper indexing speed or searching.
Thanks,
Daniel
Well I think you kind of answered this yourself - the current setting is already set at this "sweetspot" that balances between efficiency and performance. 34% actually sounds pretty good to me. Remember that the figure you're getting is not just the actual compressed raw data, but also the corresponding metadata that Splunk needs in order to make the data searchable. The compression of the raw data itself is standard gzip and typical figures for this compressed vs uncompressed raw data are around 10% (YMMV depending on data entropy). The metadata actually takes up more storage. There are ways of limiting what metadata Splunk stores, but all these will in the end greatly impact your search performance. My advice would be to just leave the settings as they are.
I believe splunk uses standard zlib library. There is py script you can playing with but I strongly recommend against it. My guess they set the compression ratio to strike a balance between speed vs space.
root:/var/root # locate zlib|grep splunk
/opt/splunk/6.3/lib/python2.7/encodings/zlib_codec.py
/opt/splunk/6.3/lib/python2.7/lib-dynload/zlib.so
Well I think you kind of answered this yourself - the current setting is already set at this "sweetspot" that balances between efficiency and performance. 34% actually sounds pretty good to me. Remember that the figure you're getting is not just the actual compressed raw data, but also the corresponding metadata that Splunk needs in order to make the data searchable. The compression of the raw data itself is standard gzip and typical figures for this compressed vs uncompressed raw data are around 10% (YMMV depending on data entropy). The metadata actually takes up more storage. There are ways of limiting what metadata Splunk stores, but all these will in the end greatly impact your search performance. My advice would be to just leave the settings as they are.
Thanks Ayn, something tells me you have done this before. So much of this falls on the compression scheme of GZIP. I guess nothing much is going to change that as well as the back end logic Splunk is using here. Ok, time to think about hardware more now 🙂