extraction-framework "Bad IRI" and "Illegal character in IRI" across latest-core collection.

The latest-core collection at https://databus.dbpedia.org/dbpedia/collections/latest-core as downloaded on January 28, 2022 has many "Bad IRI" and "Illegal character in IRI" issues across the data as reported by Apache Jena's riot --validate command. For example:

article-templates_lang=en.ttl.bz2 : 474.07 sec : 50,428,351 Triples : 106,372.54 per second : 0 errors : 28,718 warnings

It would be more robust to ensure the published triples pass all syntax checks.

References:

https://jena.apache.org/documentation/io/

Jan 28 '22 16:01 donpellegrino

@donpellegrino I am transferring this issue to https://github.com/dbpedia/extraction-framework/issues

Jan 29 '22 08:01 kurzum

Hi @donpellegrino,

It would be more robust to ensure the published triples pass all syntax checks.

there is a lot of variation in this and there is no such thing as "all" syntax checks. About a year ago, we built this parser: https://github.com/dbpedia/databus-derive which uses Jena 3.13.1 It is highly parallelized and should be one if not the fastest parser out there. It also does more than parsing as it also writes quite detailed parselogs and logs all malformed triples.

We also publish the parselogs here: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/
Back then we filed a bug report about a warning in Jena and they especially updated their parser in version 3.13.1 for us.

I think that this here https://github.com/dbpedia/databus-derive/blob/master/src/main/java/org/dbpedia/databus/derive/io/rdf/NoErrorProfile.java is the exact parser profile we are using to configure Jena.

I looked at http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/2021.09.01/article-templates_lang=en_debug.txt.bz2 and it seems that we need to update icu, which is the unicode library. most of the problems are caused by new emojis.

Then the result of riot --validate highly depends on the Jena version you are using. I tested with rapper/libraptor and there is no error found in 2021.12.01

rapper -i ntriples article-templates_lang\=en.ttl -c 
rapper: Parsing URI file:///home/kurzum/Downloads/article-templates_lang=en.ttl with parser ntriples
rapper: Parsing returned 50428351 triples

Looking at 0 errors : 28,718 warnings this seems to be the Jena warning fixed related to NFKC Unicode. @donpellegrino could you post the jena version and potentially more detailed information?

@Vehnem parselogs after 09.2021 are missing: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/

Jan 29 '22 08:01 kurzum

I used Jena version 3.17.0:

> riot --version
Jena:       VERSION: 3.17.0
Jena:       BUILD_DATE: 2020-11-25T19:40:23+0000

For the Unicode interpretation, I am not sure if that comes from Jena directly or would depend on the underlying Java implementation. For my original report, I was running it with Oracle Java 1.8.0_291-b10:

> java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)

The locale is UTF-8:

> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Switching to OpenJDK 11.0.13:

> java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-suse-3.65.1-x8664)
OpenJDK 64-Bit Server VM (build 11.0.13+8-suse-3.65.1-x8664, mixed mode)

OpenJDK 11.0.13 also gives the warnings:

> riot --validate --time article-templates_lang\=en.ttl.bz2
<snip>
09:02:11 WARN  riot            :: [line: 50422047, col: 35] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝅘𝅥[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422047, col: 36] Illegal character in IRI (Not a ucschar: 0xDD72): <http://dbpedia.org/resource/𝅘𝅥?[U+DD72]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 31] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 32] Illegal character in IRI (Not a ucschar: 0xDDBA): <http://dbpedia.org/resource/?[U+DDBA]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 33] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝆺[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 34] Illegal character in IRI (Not a ucschar: 0xDD65): <http://dbpedia.org/resource/𝆺?[U+DD65]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 35] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝆺𝅥[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 36] Illegal character in IRI (Not a ucschar: 0xDD6F): <http://dbpedia.org/resource/𝆺𝅥?[U+DD6F]...>
article-templates_lang=en.ttl.bz2 : 593.94 sec : 50,428,351 Triples : 84,904.79 per second : 0 errors : 28,718 warnings

Jan 31 '22 14:01 donpellegrino

Hi, I will check it this week. The issue seems valid. The RDF pruning/validation process seems to have failed (or was not working correctly)

Feb 07 '22 14:02 Vehnem