Hunk Job get OutOfMemory Error

techdiverdown · ‎06-25-2014

I have an existing virtual index with some data and it works fine. I decided to compress the data with snappy and I moved this data to another directory in HDFS. I then created a new virtual index to read the compressed data and I get the following:

06-25-2014 10:03:22.810 INFO ERP.psb_cloudera - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
06-25-2014 10:03:22.811 INFO ERP.psb_cloudera - at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)

Please advise how to increase the memory in Hunk, assuming that I use the vix.env.HADOOP_HEAPSIZE. I pushed this to 1024 from 512 and still getting the same error. I could increase more, I am wondering why the needed increase if the compress/decompress is mostly CPU cycles.

My original files were 2014-05-24-16-00.01.csv, the new files are 2014-05-24-16-00.01.csv.snappy. The size was about 300 MB per file, ow the size is approx. 60-80 MB per file.

By the way the job seems to die within a few seconds even if i remove all but one of the files in the directory.

*ADDITIONAL INFO**
If I use gzip compression, everything works fine. I will try bzip2 and lzo as well. I believe I want a splitable compression for the file storage in HDFS so that can be seen from this link:

http://comphadoop.weebly.com/index.html

Ledion_Bitincka · ‎06-25-2014

Can you please provide the (scrubbed) contents of search.log as well as indexes.conf?

Also can you test to see if the following command throws the same error:

hadoop fs -text hdfs://host:port/path/to/file.snappy

Also, what version of Hadoop and Snappy libraries are you using?

bosburn_splunk · ‎06-29-2014

Can you open a ticket up and email bosburn@splunk.com the ticket number?

Ledion_Bitincka · ‎06-27-2014

You can either email them to support or maybe post them on pastebin and provide a link

techdiverdown · ‎06-27-2014

Snappy lib 1.0.2, using python-snappy from github.
$hadoop version
Hadoop 2.3.0-cdh5.0.2
Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8
Compiled by jenkins on 2014-06-09T16:20Z
Compiled with protoc 2.5.0
From source with checksum 75596fe27f833e512f27fbdaaa7b0ab
This command was run using /usr/lib/hadoop/hadoop-common-2.3.0-cdh5.0.2.jar

techdiverdown · ‎06-27-2014

Dumb question - How do I upload these logs? I cannot paste them into this window.

techdiverdown · ‎06-27-2014

The above command works fine.

hadoop fs -ls hdfs://cloudera-node0:8020/user/netflow/2014-05-24-17-30-01.csv.snappy
14/06/27 15:02:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x 3 netflow netflow 67108864 2014-06-27 14:54 hdfs://cloudera-node0:8020/user/netflow/2014-05-24-17-30-01.csv.snappy

Hunk Job get OutOfMemory Error

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!