dkpro-c4corpus
dkpro-c4corpus copied to clipboard
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
This fixes the three issues mentioned above: - #27 - Allowing deeply nested document to be processed as well as speeding up processing in general. Rather than continually backtracking to...
The project has been released with groupId > org.dkpro.c4corpus But is still using the old package hierarchy i.e. > de.tudarmstadt.ukp.dkpro.c4corpus This should be fixed. Causes confusion when referencing to classes...
It would be helpful to add the functionality to boilerplate remover command line to also accept directories as input argument.
The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it...
The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.
The conditional here is wrong: https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350 causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head...
The text normalization in [Utils.normalize() ](https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/Utils.java#L117) seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream...
Comparing these two files: - /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt - /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt It appears that the Python program is dropping ` ` entities, but not decoding some other such as `<`. The gold standard doesn't...
Attempts to process this segment: s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node). The...
Currently we rely on Hadoop 2.6.0 which is present in AWS EMR 4.2.0. We should update to Hadoop 2.7.1 to keep up with the latest EMR version (4.4.0).