What is the best practise HDFS File Compression fo...

duncangoff · ‎08-11-2017

Given the option of compressing files using the above technologies which would be the best practise for use with Splunk Analytics for Hadoop (Hunk)?

From my own research I'm leaning towards snappy for searches, lzma for space with gzip and bzip2 in the middle space with bzip2 being generally better than gzip.

I have also seen others recommending LZO, although i'm not sure if this is an option for this specific case.

Thoughts?

shaskell_splunk · ‎08-11-2017

There are lots of tradeoffs and factors to consider here. Splunk Analytics for Hadoop will work with many popular compression codecs utilizing commons-compress-1.10.jar (Splunk 6.6.2)

Here is a list of codecs:
bzip2, gzip, pack200, lzma, xz, Snappy, traditional Unix Compress, DEFLATE, LZ4, Brotli and ar, cpio, jar, tar, zip, dump, 7z, arj
Commons Compress 1.10 On Maven

With that said, I'd suggest you consult with your Hadoop vendor or experiment to see what gives you the best performance for the given compression ratio. One recommendation I am comfortable giving would be to pick a compression codec that is natively splittable in Hadoop like bzip2, Snappy or LZO. I've seen performance issues on non-splittable compression codecs like gzip.

Here's a link from Cloudera on data compression performance for your reference:
https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_data_compression_performance.ht...

rdagan_splunk · ‎08-11-2017

Since Hunk submit the job to Hadoop for processing the same Hadoop recommendation you get from Cloudera, Hortonworks, or MapR also applies here.
As you highlighted, there is a tradeoff between speed and compression rate, and Snappy seems like the favorite codec.
However, before you select the compression codec, I highly recommend you select the right File format first (Text, Avro, Parquet, ORC) and only then decide about the Codec (note that not all formats support all compression options):
https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

sloshburch · ‎08-11-2017

The only Best Practices around this feature appear to be documented at http://docs.splunk.com/Documentation/Splunk/latest/HadoopAnalytics/Performancebestpractices.

I'll see if others with more experience have expressed benefits of one compression tech over another.

What is the best practise HDFS File Compression for use with Splunk Analytics For Hadoop given a choice of GZIP, BZIP2, LZMA and Snappy?

Why You Can't Miss .conf25: Unleashing the Power of Agentic AI with Splunk & Cisco

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

Your summer travels continue with new course releases

Are you a member of the Splunk Community?

What is the best practise HDFS File Compression for use with Splunk Analytics For Hadoop given a choice of GZIP, BZIP2, LZMA and Snappy?

Why You Can't Miss .conf25: Unleashing the Power of Agentic AI with Splunk & Cisco

Deep Dive into Federated Analytics: Unlocking the Full Power of Your Security Data

Your summer travels continue with new course releases