request JSON-LD from Link rel=alternate
Ref #128
With the following test cases defined:
context.jsonld
{
"@context": {
"@vocab": "https://w3c.github.io/json-ld-api/tests/vocab#",
"dcterms": "http://purl.org/dc/terms/",
"jld": "https://w3c.github.io/json-ld-api/tests/vocab#",
"mf": "http://www.w3.org/2001/sw/DataAccess/tests/test-manifest#",
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"context": { "@type": "@id" },
"expect": { "@id": "mf:result", "@type": "@id" },
"expectErrorCode": { "@id": "mf:result" },
"frame": { "@type": "@id" },
"input": { "@id": "mf:action", "@type": "@id" },
"option": { "@type": "@id"},
"sequence": { "@id": "mf:entries", "@type": "@id", "@container": "@list" },
"redirectTo": { "@type": "@id"},
"name": "mf:name",
"purpose": "rdfs:comment",
"description": "rdfs:comment",
"base": { "@type": "@id" },
"compactArrays": { "@type": "xsd:boolean" },
"compactToRelative": { "@type": "xsd:boolean" },
"contentType": { "@type": "xsd:string" },
"expandContext": { "@type": "@id" },
"extractAllScripts": { "@type": "xsd:boolean" },
"httpLink": { "@type": "xsd:string", "@container": "@set" },
"httpStatus": { "@type": "xsd:integer" },
"normative": { "@type": "xsd:boolean" },
"processingMode": { "@type": "xsd:string" },
"processorFeature": { "@type": "xsd:string" },
"produceGeneralizedRdf":{ "@type": "xsd:boolean" },
"specVersion": { "@type": "xsd:string" },
"useNativeTypes": { "@type": "xsd:boolean" }
}
}
manifest.jsonld
{
"@context": ["context.jsonld", {"@base": "manifest"}],
"@id": "",
"@type": "mf:Manifest",
"name": "JSON-LD Test Suite",
"description": "This manifest loads some tests for resolving https://github.com/digitalbazaar/pyld/issues/128",
"sequence": [{
"@id": "#t1",
"@type": ["jld:PositiveEvaluationTest", "jld:ExpandTest"],
"name": "Test for JSON-LD via Link header",
"purpose": "Tests for correct retrieval of remote JSON-LD when it is present as a Link HTTP header",
"input": "/full/path/to/pyld/tests/sample.jsonld",
"expect": "/full/path/to/pyld/tests/output.jsonld"
}, {
"@id": "#t2",
"@type": ["jld:PositiveEvaluationTest", "jld:ExpandTest"],
"name": "Test for JSON-LD via direct JSON-LD URL",
"purpose": "Tests for correct retrieval of remote JSON-LD when it is given as a direct link within a context",
"input": "/full/path/to/pyld/tests/sample2.jsonld",
"expect": "/full/path/to/pyld/tests/output.jsonld"
}]
}
sample.jsonld
{
"@context": "https://schema.org",
"@type":"Dataset",
"@id":"http://localhost:5000/collections/obs",
"url":"http://localhost:5000/collections/obs"
}
sample2.jsonld
{
"@context": "https://schema.org/docs/jsonldcontext.jsonld",
"@type":"Dataset",
"@id":"http://localhost:5000/collections/obs",
"url":"http://localhost:5000/collections/obs"
}
output.jsonld
[
{
"@id": "http://localhost:5000/collections/obs",
"@type": [
"http://schema.org/Dataset"
],
"http://schema.org/url": [
{
"@id": "http://localhost:5000/collections/obs"
}
]
}
]
The tests both fail before the changes. The tests both passs after the changes. Since all the test cases of this repository are remote, I am not sure whether or where to contribute these test cases.
However, there is one regression, which I do not currently know how to resolve. This test does not have an error before the changes, and does have an error after the changes. The report is:
======================================================================
ERROR: Remote document: https://w3c.github.io/json-ld-api/tests/remote-doc-manifest#t0002: Document loader loads a JSON document.
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/runtests.py", line 372, in runTest
raise e
File "tests/runtests.py", line 319, in runTest
result = getattr(jsonld, fn)(*params)
File "tests/../lib/pyld/jsonld.py", line 163, in expand
return JsonLdProcessor().expand(input_, options)
File "tests/../lib/pyld/jsonld.py", line 820, in expand
remote_doc = load_document(input_, options)
File "tests/../lib/pyld/jsonld.py", line 6591, in load_document
code='loading document failed')
pyld.jsonld.JsonLdError: ('No remote document found at the given URL.',)
Type: jsonld.NullRemoteDocument
Code: loading document failed
All other tests are unaffected. However, as mentioned in #128, running the tests as described with no changes elicits 5 failures, 2 errors, and 34 skipped tests.
This PR does not fixes the following test:
from pyld import jsonld
jsonld.expand(
{
"@context": "http://schema.org/",
"@type": "Person",
"name": "Jane Doe",
"jobTitle": "Professor",
"telephone": "(425) 123-4567",
"url": "http://www.janedoe.com",
}
)
because https://schema.org gives Content-Type: text/html, which is in headers["Accept"]:
(Pdb) p headers["Accept"]
'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5, text/html;q=0.8, application/xhtml+xml;q=0.8'
it's here due to pyld/jsonld.py adding it:
6573 # FIXME: only if html5lib loaded?
6574 headers['Accept'] = headers['Accept'] + ', text/html;q=0.8, application/xhtml+xml;q=0.8'
According to json-ld:
A processor seeing a non-JSON result will note the presence of the link header and load that document instead.
So even if we accept HTML, if it's HTML and there's a json-ld alternative, let's use it.
To do this, I'll move doc['document'] = response.json() right before returning, and just drop the if content_type not in headers['Accept']: you added, it works for me.
In other words (and with a correct indentation) I mean:
--- a/lib/pyld/documentloader/requests.py
+++ b/lib/pyld/documentloader/requests.py
@@ -69,7 +69,6 @@ def requests_document_loader(secure=False, **kwargs):
'contentType': content_type,
'contextUrl': None,
'documentUrl': response.url,
- 'document': response.json() if content_type in headers['Accept'] else None
}
link_header = response.headers.get('link')
if link_header:
@@ -77,15 +76,15 @@ def requests_document_loader(secure=False, **kwargs):
LINK_HEADER_REL)
# only 1 related link header permitted
if linked_context and content_type != 'application/ld+json':
- if isinstance(linked_context, list):
- raise JsonLdError(
- 'URL could not be dereferenced, '
- 'it has more than one '
- 'associated HTTP Link Header.',
- 'jsonld.LoadDocumentError',
- {'url': url},
- code='multiple context link headers')
- doc['contextUrl'] = linked_context['target']
+ if isinstance(linked_context, list):
+ raise JsonLdError(
+ 'URL could not be dereferenced, '
+ 'it has more than one '
+ 'associated HTTP Link Header.',
+ 'jsonld.LoadDocumentError',
+ {'url': url},
+ code='multiple context link headers')
+ doc['contextUrl'] = linked_context['target']
linked_alternate = parse_link_header(link_header).get('alternate')
# if not JSON-LD, alternate may point there
if (linked_alternate and
@@ -93,9 +92,8 @@ def requests_document_loader(secure=False, **kwargs):
not re.match(r'^application\/(\w*\+)?json$', content_type)):
doc['contentType'] = 'application/ld+json'
doc['documentUrl'] = prepend_base(url, linked_alternate['target'])
- if content_type not in headers['Accept']:
- # Original was not JSON/JSON-LD; fetch linked JSON-LD
- return loader(doc['documentUrl'], options=options)
+ return loader(doc['documentUrl'], options=options)
+ doc['document'] = response.json()
return doc
except JsonLdError as e:
raise e
Hmm, that's a bit odd. Yes, Pyld adds text/html (in the line you pointed out, i.e. headers['Accept'] = headers['Accept'] + ', text/html;q=0.8, application/xhtml+xml;q=0.8') but that's a default header value and it shouldn't affect the server response as long as application/json+ld is included in the Accept header value with a higher precedence. That check
if content_type not in headers['Accept']:
# Original was not JSON/JSON-LD; fetch linked JSON-LD
return loader(doc['documentUrl'], options=options)
is important, it checks whether the server responded with application/json+ld, and if not, attempts to fetch the linked resource. There's no value doing this if the response is already JSON-LD.
it shouldn't affect the server response as long as
application/json+ldis included in theAcceptheader value with a higher precedence.
Totally agree. But looks like https://schema.org/ does not have a application/json-ld variant at all, so it always reply with text/html, independently of the Accept header. But this text/html response links to the ld+json, see:
$ curl -I -H "Accept: application/ld+json" https://schema.org
HTTP/2 200
[...]
link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"
[...]
content-type: text/html
[...]
Which looks legit to me, even though I still didn't read and understood rfc8288 entierly.
That check
if content_type not in headers['Accept']: # Original was not JSON/JSON-LD; fetch linked JSON-LD return loader(doc['documentUrl'], options=options)is important, it checks whether the server responded with
application/json+ld, and if not, attempts to fetch the linked resource. There's no value doing this if the response is already JSON-LD.
IIRC It looks already covered by the:
not re.match(r'^application/(\w*+)?json$', content_type)):
check a few line before.
Ah I see. Sorry it's been a while since I've looked at this. I think your version of the PR makes sense, and as long as both our tests pass, it gets my vote.
and as long as both our tests pass, it gets my vote.
I'd like to add both tests to the test suite, but I don't understand how to do so. If someone can explain before merging I'd gladly do so.