Jesse de Does
Jesse de Does
Dear John, You are right, the plain text output from ALTO is execrable. The reason is that conversion takes place indirectly, ALTO --> tokenized TEI with zoning --> plain text....
I have added content in the lib directory. Please let me know if you have any problems!
Hello all, sorry to catch up only today - The right command line for conversion from txt to TEI is (txt not text) java -jar OpenConvert.jar -from txt -to TEI...
Thanks both!! I can install @PonteIneptique's version. I run into cuda issues later on, but that is most likely a problem of my local machine.
Thanks again! (My machine does have cuda, but it magically gets mixed up on system updates from time to time)
First the easy ones: - We fixed the validation issue found by Tomaz in one of the files - We removed the resp statement for linguistic annotation from the annotated...
- missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first...
* Using common taxonomies. We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file...
Multipe speaker types indeed break the validation: ``` Error: Type error on line 332 column 49 of parlamint-lib.xsl: XTTE0780 A sequence of more than one item is not allowed as...
Summarizing: - Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow `subtype="problematic_content"` or something along those lines? - We removed some...