Monitoring Splunk

I would like to use compression when submitting the MR2 jobs. How do I configure the virtual index provider?

techdiverdown
Path Finder

Cloudera CDH 5.02 and Hunk 6.2 I believe, or whatever is the latest version of both. Anyway, I was trying to turn on snappy compression, which i did from Cloudera, but there are several compression settings that should be pushed from the job level. So here are the configs in the Virtual Index Provider, please verify they are correct. So far I see no performance benefit.

vix.mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec
vix.mapreduce.output. fileoutputformat. compress.type = BLOC
vix.mapreduce.output.fileoutputformat. compress = true

Are these correct and if so, why am I seeing no performance difference? I am processing netflow data, each file is about 300 MB for 15 minutes of netflow data. So using a date range and verbose mode, it takes about 10 minutes to process 94 files X 300MB per file. Note the netflow data is not compresses in HDFS. If these need to be compressed on HDFS I assume LZO or Snappy?

Thanks.

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

The following settings have no effect on Hunk as we use gzip (not configurable) to compress the mapper output results.

vix.mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec 
vix.mapreduce.output. fileoutputformat.compress.type = BLOC     
vix.mapreduce.output.fileoutputformat.compress = true

So using a date range and verbose
mode, it takes about 10 minutes to
process 94 files X 300MB per file.

Any particular reason that you're using verbose mode? This search mode is used primarily for exploratory needs and it is extremely expensive for reporting searches and introduces quite a bit of over head.

If these need to be compressed on HDFS I assume LZO or Snappy?

Yes, there's usually performance benefits from using compressed raw data - I'd recommend Snappy as it generally has better read throughput performance.

What types of searches are you trying to run on the data?

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Community Content Calendar, September edition

Welcome to another insightful post from our Community Content Calendar! We're thrilled to continue bringing ...

Splunkbase Unveils New App Listing Management Public Preview

Splunkbase Unveils New App Listing Management Public PreviewWe're thrilled to announce the public preview of ...

Leveraging Automated Threat Analysis Across the Splunk Ecosystem

Are you leveraging automation to its fullest potential in your threat detection strategy?Our upcoming Security ...