extraction-framework Illegal unicode char in extracted file

Line #66650918 in https://downloads.dbpedia.org/repo/dbpedia/wikidata/sameas-all-wikis/2020.03.01/sameas-all-wikis.ttl.bz2 with Subject http://wikidata.dbpedia.org/resource/Q9398047 Contains the U+FFFC unicode character in the Object which is not a valid IRI according to https://tools.ietf.org/html/rfc3987#section-2.2

By the way, this causes parsing by RDF4J to fail fatally (see this ticket I opened there)

Jun 22 '20 08:06 elad-shaked

Thank you.

@Vehnem I can confirm <http://wikidata.dbpedia.org/resource/Q9398047> <http://www.w3.org/2002/07/owl#sameAs> <http://pl.dbpedia.org/resource/> . leads to a crash of our parser as well. Can you post the error message?

Rapper and W3C validator and Jena IRI validator seem to accept it. FFFC seems however not legit.

Jun 22 '20 20:06 JJ-Author

opened an issue at https://issues.apache.org/jira/browse/JENA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jun 25 '20 12:06 JJ-Author

Was fixed in Jena https://issues.apache.org/jira/browse/JENA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel.

I will update the post-processing parser to the latest Apache Jena version for deleting these triples in the final release. Code: https://github.com/dbpedia/databus-derive

For testing: We will write a so-called Construct-Validation test that is able to check if the character appears in any part of the URIs, without applying a concrete RDF parser.

Nov 29 '21 11:11 Vehnem