ucto icon indicating copy to clipboard operation
ucto copied to clipboard

ucto creates invalid folia

Open kosloot opened this issue 6 years ago • 2 comments

given the attached file issue77.xml.txt ucto will create invalid folia: UIT.xml.text The command was: ucto --passthru issue77.xml UIT.xml

>foliavalidator UIT.xml 
VALIDATION ERROR on full parse by library (stage 2/3), in UIT.xml
ParseError: FoLiA exception in handling of <s> @ line 47 (in parent <p> @ parent line 44) : [DeclarationError] Processor ucto.1 is used for annotationtype SENTENCE, set None, but has no corresponding <annotator> referring to it from the annotations declaration block!

SIDENOTE: folialint doesn't complain added as https://github.com/LanguageMachines/libfolia/issues/42

issue77.xml.txt UIT.xml.txt

kosloot avatar Jan 27 '20 16:01 kosloot

I think there are several issues here.

  1. When using passthru, it is maybe not correct that ucto tries to assign a Sentence and Words to the second paragraph. @proycon wath should --passthru do here? The documentation states: Don't tokenize, but perform input decoding and simple token role detection
  2. But a similar problem arises when we use ucto -Lnld issue77.xml UIT.xml in that case ucto creates a new sentence with processor ucto1 but uses the old sentence-annotation form the input. It should add an extra sentence-annotation referring ucto.1 When the answer for 1. is: 'OK just add a sentence and a word' then the same would hold using the "passthru" set.

kosloot avatar Jan 28 '20 14:01 kosloot

point 2 is (for now) resolved by 'adopting' the already present annotations. This produces correct FoLiA, but the question remains if this is the best solution.

Maybe we should reject such input. But there are use-cases where annotations are defined, (and sometimes NOT used at all). We could also make ucto assign some own segmentation set for such cases. But this also has some troublesome consequences.

For now I suggest to stick with this half-baked solution. But feeling a bit worried.

kosloot avatar Feb 05 '20 11:02 kosloot