extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Illegal unicode char in extracted file

Open elad-shaked opened this issue 5 years ago • 3 comments

Line #66650918 in https://downloads.dbpedia.org/repo/dbpedia/wikidata/sameas-all-wikis/2020.03.01/sameas-all-wikis.ttl.bz2 with Subject http://wikidata.dbpedia.org/resource/Q9398047 Contains the U+FFFC unicode character in the Object which is not a valid IRI according to https://tools.ietf.org/html/rfc3987#section-2.2

By the way, this causes parsing by RDF4J to fail fatally (see this ticket I opened there)

elad-shaked avatar Jun 22 '20 08:06 elad-shaked

Thank you.

@Vehnem I can confirm <http://wikidata.dbpedia.org/resource/Q9398047> <http://www.w3.org/2002/07/owl#sameAs> <http://pl.dbpedia.org/resource/> . leads to a crash of our parser as well. Can you post the error message?

Rapper and W3C validator and Jena IRI validator seem to accept it. FFFC seems however not legit.

JJ-Author avatar Jun 22 '20 20:06 JJ-Author

opened an issue at https://issues.apache.org/jira/browse/JENA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

JJ-Author avatar Jun 25 '20 12:06 JJ-Author

Was fixed in Jena https://issues.apache.org/jira/browse/JENA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel.

I will update the post-processing parser to the latest Apache Jena version for deleting these triples in the final release. Code: https://github.com/dbpedia/databus-derive

For testing: We will write a so-called Construct-Validation test that is able to check if the character appears in any part of the URIs, without applying a concrete RDF parser.

Vehnem avatar Nov 29 '21 11:11 Vehnem