I have an existing virtual index with some data and it works fine. I decided to compress the data with snappy and I moved this data to another directory in HDFS. I then created a new virtual index to read the compressed data and I get the following:
06-25-2014 10:03:22.810 INFO ERP.psb_cloudera - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
06-25-2014 10:03:22.811 INFO ERP.psb_cloudera - at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
Please advise how to increase the memory in Hunk, assuming that I use the vix.env.HADOOP_HEAPSIZE. I pushed this to 1024 from 512 and still getting the same error. I could increase more, I am wondering why the needed increase if the compress/decompress is mostly CPU cycles.
My original files were 2014-05-24-16-00.01.csv, the new files are 2014-05-24-16-00.01.csv.snappy. The size was about 300 MB per file, ow the size is approx. 60-80 MB per file.
By the way the job seems to die within a few seconds even if i remove all but one of the files in the directory.
If I use gzip compression, everything works fine. I will try bzip2 and lzo as well. I believe I want a splitable compression for the file storage in HDFS so that can be seen from this link:
Can you please provide the (scrubbed) contents of search.log as well as indexes.conf?
Also can you test to see if the following command throws the same error:
hadoop fs -text hdfs://host:port/path/to/file.snappy
Also, what version of Hadoop and Snappy libraries are you using?
Snappy lib 1.0.2, using python-snappy from github.
Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8
Compiled by jenkins on 2014-06-09T16:20Z
Compiled with protoc 2.5.0
From source with checksum 75596fe27f833e512f27fbdaaa7b0ab
This command was run using /usr/lib/hadoop/hadoop-common-2.3.0-cdh5.0.2.jar
The above command works fine.
hadoop fs -ls hdfs://cloudera-node0:8020/user/netflow/2014-05-24-17-30-01.csv.snappy
14/06/27 15:02:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxr-xr-x 3 netflow netflow 67108864 2014-06-27 14:54 hdfs://cloudera-node0:8020/user/netflow/2014-05-24-17-30-01.csv.snappy