The following settings have no effect on Hunk as we use gzip (not configurable) to compress the mapper output results.
vix.mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec
vix.mapreduce.output. fileoutputformat.compress.type = BLOC
vix.mapreduce.output.fileoutputformat.compress = true
So using a date range and verbose
mode, it takes about 10 minutes to
process 94 files X 300MB per file.
Any particular reason that you're using verbose mode? This search mode is used primarily for exploratory needs and it is extremely expensive for reporting searches and introduces quite a bit of over head.
If these need to be compressed on HDFS I assume LZO or Snappy?
Yes, there's usually performance benefits from using compressed raw data - I'd recommend Snappy as it generally has better read throughput performance.
What types of searches are you trying to run on the data?
... View more