Seems to require added the streaming jar to the hadoop classpath
In order to use this jar with hadoop streaming on a standalone installation of Cloudera CDH3 on Ubuntu Linux, I had to do two things:
- change the ant file to add:
+++ b/build.xml
@@ -21,6 +21,8 @@
-
<fileset dir="${hadoop.home}" -
includes="contrib/streaming/hadoop-streaming-*.jar" />
- add the selected hadoop streaming jar to the HADOOP_CLASSPATH.
In dumbo/backends/streaming.py, I added: if addedopts['libjarstreaming'] and addedopts['libjarstreaming'][0] != 'no': addedopts['libjar'].append(streamingjar) which seemed to be required to get it to work.
Without this, I always got an error that it couldn't figure out where where org.apache.hadoop.typedbytes.TypedBytesWritable was for the Partition function:
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/typedbytes/TypedBytesWritable
at fm.last.feathers.partition.Prefix.
After that, I was able to do use the partition/Prefix class successfully.
Thanks for the tip!