extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Some dbpedia files contain invalid literals with rdf:langString and empty language tag

Open desislava-hristova-ontotext opened this issue 6 years ago • 5 comments

The following dbpedia files (and probably more) contain invalid literals https://downloads.dbpedia.org/repo/lts/generic/infobox-properties/2019.08.30/infobox-properties_lang%3den.ttl.bz2 https://downloads.dbpedia.org/repo/lts/generic/persondata/2019.08.30/persondata_lang%3den.ttl.bz2 with rdf:langString but without language tag.

See: https://www.w3.org/TR/rdf11-concepts/#dfn-language-tagged-string

All such files cannot be loaded using RDF4J as it does not tolerate it and returns an error: "RDF Parse Error: datatype rdf:langString requires a language tag [line 1]"

Hi @desislava-hristova-ontotext the output seems not correct on semantic level that is for sure. We leave this open to the community to fix this (minor) extraction bug.

However we will start a discussion whether this triple should be filtered out from our parsed dbpedia release from databus into an erroneous triple file for future releases or not. Your help and input would be valuable for us. The parsing / triple validation at the moment is performed with Jena.

Jena as of 3.14 does not report an error.

➜ bin curl https://downloads.dbpedia.org/repo/lts/generic/persondata/2019.08.30/persondata_lang%3den.ttl.bz2 | lbzcat | riot --validate
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   290  100   290    0     0   1218      0 --:--:-- --:--:-- --:--:--  1223
➜  bin 

So if you think this should be excluded please post an issue on Jena so that they can fix the parser.

Moreover, is it possible for you to ignore the warnings with rdf4j and still load the file? I know for stardog there was a flag to disable strict parsing. Probably this also exist for GraphDB?

JJ-Author avatar Jan 24 '20 16:01 JJ-Author

https://github.com/dbpedia/extraction-framework/blob/91577ca39df1bc4a6a6aab5fb88d0e0a069df816/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TripleFormatter.scala#L19

else if condition is to weak, should be something like if ( dt == rdflangString && languange )

Vehnem avatar Feb 20 '20 00:02 Vehnem

Finally the missing language gets handled wrong here https://github.com/dbpedia/extraction-framework/blob/91577ca39df1bc4a6a6aab5fb88d0e0a069df816/core/src/main/scala/org/dbpedia/extraction/destinations/formatters/TerseBuilder.scala#L36 So still not sure where this is build

Vehnem avatar Feb 20 '20 00:02 Vehnem

➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.07.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   266  100   266    0     0   4030      0 --:--:-- --:--:-- --:--:--  4030
➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.08.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   266  100   266    0     0   4666      0 --:--:-- --:--:-- --:--:--  4666
➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.08.30/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   290  100   290    0     0   5087      0 --:--:-- --:--:-- --:--:--  5087
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/name> "Jim Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/surname> "Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/givenName> "Jim"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://purl.org/dc/elements/1.1/description> "American radio personality"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
➜  20200220 git:(master) ✗ curl http://dbpedia-generic.tib.eu/release/generic/persondata/2019.10.01/persondata_lang\=en.ttl.bz2 | lbunzip2| grep langString | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   337  100   337    0     0   5106      0 --:--:-- --:--:-- --:--:--  5106
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/name> "Jim Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/surname> "Pewter"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://xmlns.com/foaf/0.1/givenName> "Jim"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .
<http://dbpedia.org/resource/Jim_Pewter> <http://purl.org/dc/elements/1.1/description> "American radio personality"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#langString> .

The error is produced since version 08.30 (marvin extraction). Since then we included two preprocessing streps.

../run ResolveTransitiveLinks $EXTRACTIONBASEDIR redirects redirects_transitive .ttl.bz2 @downloaded   
../run MapObjectUris $EXTRACTIONBASEDIR redirects_transitive .ttl.bz2 disambiguations,infobox-properties,page-links,persondata,topical-concepts _redirected .ttl.bz2 @downloaded

https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/blob/master/functions.sh#L67

Vehnem avatar Feb 20 '20 01:02 Vehnem

@Vehnem Can we add a test for this type of errors? Or we already have such a test?

m1ci avatar May 15 '20 10:05 m1ci