Hunk Job get OutOfMemory Error

techdiverdown · ‎06-25-2014

I have an existing virtual index with some data and it works fine. I decided to compress the data with snappy and I moved this data to another directory in HDFS. I then created a new virtual index to read the compressed data and I get the following:

06-25-2014 10:03:22.810 INFO ERP.psb_cloudera - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
06-25-2014 10:03:22.811 INFO ERP.psb_cloudera - at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)

Please advise how to increase the memory in Hunk, assuming that I use the vix.env.HADOOP_HEAPSIZE. I pushed this to 1024 from 512 and still getting the same error. I could increase more, I am wondering why the needed increase if the compress/decompress is mostly CPU cycles.

My original files were 2014-05-24-16-00.01.csv, the new files are 2014-05-24-16-00.01.csv.snappy. The size was about 300 MB per file, ow the size is approx. 60-80 MB per file.

By the way the job seems to die within a few seconds even if i remove all but one of the files in the directory.

*ADDITIONAL INFO**
If I use gzip compression, everything works fine. I will try bzip2 and lzo as well. I believe I want a splitable compression for the file storage in HDFS so that can be seen from this link:

http://comphadoop.weebly.com/index.html

Ledion_Bitincka · ‎06-25-2014

Can you please provide the (scrubbed) contents of search.log as well as indexes.conf?

Also can you test to see if the following command throws the same error:

hadoop fs -text hdfs://host:port/path/to/file.snappy

Also, what version of Hadoop and Snappy libraries are you using?

bosburn_splunk · ‎06-29-2014

Can you open a ticket up and email bosburn@splunk.com the ticket number?

Ledion_Bitincka · ‎06-27-2014

You can either email them to support or maybe post them on pastebin and provide a link

techdiverdown · ‎06-27-2014

Snappy lib 1.0.2, using python-snappy from github.
$hadoop version
Hadoop 2.3.0-cdh5.0.2
Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8
Compiled by jenkins on 2014-06-09T16:20Z
Compiled with protoc 2.5.0
From source with checksum 75596fe27f833e512f27fbdaaa7b0ab
This command was run using /usr/lib/hadoop/hadoop-common-2.3.0-cdh5.0.2.jar

techdiverdown · ‎06-27-2014

Dumb question - How do I upload these logs? I cannot paste them into this window.

techdiverdown · ‎06-27-2014

The above command works fine.

hadoop fs -ls hdfs://cloudera-node0:8020/user/netflow/2014-05-24-17-30-01.csv.snappy
14/06/27 15:02:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x 3 netflow netflow 67108864 2014-06-27 14:54 hdfs://cloudera-node0:8020/user/netflow/2014-05-24-17-30-01.csv.snappy

Hunk Job get OutOfMemory Error

Announcing the Expansion of the Splunk Academic Alliance Program

Learn Splunk Insider Insights, Do More With Gen AI, & Find 20+ New Use Cases You Can ...

Buttercup Games: Further Dashboarding Techniques (Part 7)