Illegal unicode char in extracted file
Line #66650918 in https://downloads.dbpedia.org/repo/dbpedia/wikidata/sameas-all-wikis/2020.03.01/sameas-all-wikis.ttl.bz2 with Subject http://wikidata.dbpedia.org/resource/Q9398047
Contains the U+FFFC unicode character in the Object which is not a valid IRI according to https://tools.ietf.org/html/rfc3987#section-2.2
By the way, this causes parsing by RDF4J to fail fatally (see this ticket I opened there)
Thank you.
@Vehnem I can confirm
<http://wikidata.dbpedia.org/resource/Q9398047> <http://www.w3.org/2002/07/owl#sameAs> <http://pl.dbpedia.org/resource/> . leads to a crash of our parser as well. Can you post the error message?
Rapper and W3C validator and Jena IRI validator seem to accept it. FFFC seems however not legit.
opened an issue at https://issues.apache.org/jira/browse/JENA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
Was fixed in Jena https://issues.apache.org/jira/browse/JENA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel.
I will update the post-processing parser to the latest Apache Jena version for deleting these triples in the final release. Code: https://github.com/dbpedia/databus-derive
For testing: We will write a so-called Construct-Validation test that is able to check if the character appears in any part of the URIs, without applying a concrete RDF parser.