Cloud9
Cloud9 copied to clipboard
Cloud9 is a Hadoop toolkit for working with big data
Optimizations to support indexing English Gigaword 5th ed (10M docs).
- Increased language support for Wikipedia for top 24 languages by # of articles - Added disambiguation patterns for each of the 24 supported languages - ExtractWikipediaDisambiguations lets you extract...
Some error checks for parsing Wikipedia dumps and English wikipedia pages.
With this change, one should be able to process a bzip2 directly. Let me know if you have any comment.
Very minor change on line 211 to make sure AWS EMR doesn't throw errors.