dkpro-c4corpus issues

Fix O(n!) in tag depth issue

3

This fixes the three issues mentioned above: - #27 - Allowing deeply nested document to be processed as well as speeding up processing in general. Rather than continually backtracking to...

tfmorris

inconsistent package hierarchy and groupId

1

The project has been released with groupId > org.dkpro.c4corpus But is still using the old package hierarchy i.e. > de.tudarmstadt.ukp.dkpro.c4corpus This should be fixed. Causes confusion when referencing to classes...

maxxkia

passing directory as argument for boilerplate remover

It would be helpful to add the functionality to boilerplate remover command line to also accept directories as input argument.

maxxkia

enhancement

Make Java JusText implementation match Python and/or document differences

4

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it...

tfmorris

enhancement

Character encoding issues in boilerplate processing

2

The output from the boilerplate processeor, e.g. /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt, appears to use a character encoding other than UTF-8. This causes strings such as Epogen® and “A-thal” to be corrupted.

tfmorris

Boilerplate removal header post processing incorrect

The conditional here is wrong: https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/JusTextBoilerplateRemoval.java#L350 causing the algorithm to attempt to reclassify non-headings, not just headings. The inverted conditionals just to save a little indentation whitespace make my head...

tfmorris

Text normalization too aggressive?

1

The text normalization in [Utils.normalize() ](https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/src/main/java/de/tudarmstadt/ukp/dkpro/c4corpus/boilerplate/impl/Utils.java#L117) seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream...

tfmorris

enhancement

HTML entities not decoded

3

Comparing these two files: - /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt - /dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Python_Defaults_CleanEvalHTMLTestSubset/105.txt It appears that the Python program is dropping ` ` entities, but not decoding some other such as `<`. The gold standard doesn't...

tfmorris

bug

O(n!) processing in tag name/path for Paragraph in dedupe code

2

Attempts to process this segment: s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node). The...

tfmorris

enhancement

Update Hadoop to 2.7.1 to keep up with latest AWS EMR version

Currently we rely on Hadoop 2.6.0 which is present in AWS EMR 4.2.0. We should update to Hadoop 2.7.1 to keep up with the latest EMR version (4.4.0).

habernal

enhancement

dkpro-c4corpus
dkpro-c4corpus copied to clipboard

Metadata

Fix O(n!) in tag depth issue

inconsistent package hierarchy and groupId

passing directory as argument for boilerplate remover

Make Java JusText implementation match Python and/or document differences

Character encoding issues in boilerplate processing

Boilerplate removal header post processing incorrect

Text normalization too aggressive?

HTML entities not decoded

O(n!) processing in tag name/path for Paragraph in dedupe code

Update Hadoop to 2.7.1 to keep up with latest AWS EMR version

← Metadata

Owner

Metadata

dkpro-c4corpus dkpro-c4corpus copied to clipboard

Metadata

← Metadata

Owner

Metadata

dkpro-c4corpus
dkpro-c4corpus copied to clipboard