pyld icon indicating copy to clipboard operation
pyld copied to clipboard

request JSON-LD from Link rel=alternate

Open alpha-beta-soup opened this issue 5 years ago • 5 comments

Ref #128

With the following test cases defined:

context.jsonld

{
  "@context": {
    "@vocab":   "https://w3c.github.io/json-ld-api/tests/vocab#",
    "dcterms":       "http://purl.org/dc/terms/",
    "jld":      "https://w3c.github.io/json-ld-api/tests/vocab#",
    "mf":       "http://www.w3.org/2001/sw/DataAccess/tests/test-manifest#",
    "rdfs":     "http://www.w3.org/2000/01/rdf-schema#",
    "xsd":      "http://www.w3.org/2001/XMLSchema#",

    "context":         { "@type": "@id" },
    "expect":          { "@id": "mf:result", "@type": "@id" },
    "expectErrorCode": { "@id": "mf:result" },
    "frame":           { "@type": "@id" },
    "input":           { "@id": "mf:action", "@type": "@id" },
    "option":          { "@type": "@id"},
    "sequence":        { "@id": "mf:entries", "@type": "@id", "@container": "@list" },
    "redirectTo":      { "@type": "@id"},

    "name":                 "mf:name",
    "purpose":              "rdfs:comment",
    "description":          "rdfs:comment",
    "base":                 { "@type": "@id" },
    "compactArrays":        { "@type": "xsd:boolean" },
    "compactToRelative":    { "@type": "xsd:boolean" },
    "contentType":          { "@type": "xsd:string" },
    "expandContext":        { "@type": "@id" },
    "extractAllScripts":    { "@type": "xsd:boolean" },
    "httpLink":             { "@type": "xsd:string", "@container": "@set" },
    "httpStatus":           { "@type": "xsd:integer" },
    "normative":            { "@type": "xsd:boolean" },
    "processingMode":       { "@type": "xsd:string" },
    "processorFeature":     { "@type": "xsd:string" },
    "produceGeneralizedRdf":{ "@type": "xsd:boolean" },
    "specVersion":          { "@type": "xsd:string" },
    "useNativeTypes":       { "@type": "xsd:boolean" }
  }
}

manifest.jsonld

{
  "@context": ["context.jsonld", {"@base": "manifest"}],
  "@id": "",
  "@type": "mf:Manifest",
  "name": "JSON-LD Test Suite",
  "description": "This manifest loads some tests for resolving https://github.com/digitalbazaar/pyld/issues/128",
  "sequence": [{
  	"@id": "#t1",
  	"@type": ["jld:PositiveEvaluationTest", "jld:ExpandTest"],
  	"name": "Test for JSON-LD via Link header",
  	"purpose": "Tests for correct retrieval of remote JSON-LD when it is present as a Link HTTP header",
  	"input": "/full/path/to/pyld/tests/sample.jsonld",
  	"expect": "/full/path/to/pyld/tests/output.jsonld"
  }, {
    "@id": "#t2",
    "@type": ["jld:PositiveEvaluationTest", "jld:ExpandTest"],
    "name": "Test for JSON-LD via direct JSON-LD URL",
    "purpose": "Tests for correct retrieval of remote JSON-LD when it is given as a direct link within a context",
    "input": "/full/path/to/pyld/tests/sample2.jsonld",
    "expect": "/full/path/to/pyld/tests/output.jsonld"
  }]
}

sample.jsonld

{
	"@context": "https://schema.org",
	"@type":"Dataset",
	"@id":"http://localhost:5000/collections/obs",
	"url":"http://localhost:5000/collections/obs"
}

sample2.jsonld

{
	"@context": "https://schema.org/docs/jsonldcontext.jsonld",
	"@type":"Dataset",
	"@id":"http://localhost:5000/collections/obs",
	"url":"http://localhost:5000/collections/obs"
}

output.jsonld

[
  {
    "@id": "http://localhost:5000/collections/obs",
    "@type": [
      "http://schema.org/Dataset"
    ],
    "http://schema.org/url": [
      {
        "@id": "http://localhost:5000/collections/obs"
      }
    ]
  }
]

The tests both fail before the changes. The tests both passs after the changes. Since all the test cases of this repository are remote, I am not sure whether or where to contribute these test cases.

However, there is one regression, which I do not currently know how to resolve. This test does not have an error before the changes, and does have an error after the changes. The report is:

======================================================================
ERROR: Remote document: https://w3c.github.io/json-ld-api/tests/remote-doc-manifest#t0002: Document loader loads a JSON document.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/runtests.py", line 372, in runTest
    raise e
  File "tests/runtests.py", line 319, in runTest
    result = getattr(jsonld, fn)(*params)
  File "tests/../lib/pyld/jsonld.py", line 163, in expand
    return JsonLdProcessor().expand(input_, options)
  File "tests/../lib/pyld/jsonld.py", line 820, in expand
    remote_doc = load_document(input_, options)
  File "tests/../lib/pyld/jsonld.py", line 6591, in load_document
    code='loading document failed')
pyld.jsonld.JsonLdError: ('No remote document found at the given URL.',)
Type: jsonld.NullRemoteDocument
Code: loading document failed

All other tests are unaffected. However, as mentioned in #128, running the tests as described with no changes elicits 5 failures, 2 errors, and 34 skipped tests.

alpha-beta-soup avatar Jun 24 '20 02:06 alpha-beta-soup

This PR does not fixes the following test:

from pyld import jsonld

jsonld.expand(
    {
        "@context": "http://schema.org/",
        "@type": "Person",
        "name": "Jane Doe",
        "jobTitle": "Professor",
        "telephone": "(425) 123-4567",
        "url": "http://www.janedoe.com",
    }
)

because https://schema.org gives Content-Type: text/html, which is in headers["Accept"]:

(Pdb) p headers["Accept"]
'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5, text/html;q=0.8, application/xhtml+xml;q=0.8'

it's here due to pyld/jsonld.py adding it:

6573 	    # FIXME: only if html5lib loaded?
6574 	    headers['Accept'] = headers['Accept'] + ', text/html;q=0.8, application/xhtml+xml;q=0.8'

According to json-ld:

A processor seeing a non-JSON result will note the presence of the link header and load that document instead.

So even if we accept HTML, if it's HTML and there's a json-ld alternative, let's use it.

To do this, I'll move doc['document'] = response.json() right before returning, and just drop the if content_type not in headers['Accept']: you added, it works for me.

In other words (and with a correct indentation) I mean:

--- a/lib/pyld/documentloader/requests.py
+++ b/lib/pyld/documentloader/requests.py
@@ -69,7 +69,6 @@ def requests_document_loader(secure=False, **kwargs):
                 'contentType': content_type,
                 'contextUrl': None,
                 'documentUrl': response.url,
-                'document': response.json() if content_type in headers['Accept'] else None
             }
             link_header = response.headers.get('link')
             if link_header:
@@ -77,15 +76,15 @@ def requests_document_loader(secure=False, **kwargs):
                     LINK_HEADER_REL)
                 # only 1 related link header permitted
                 if linked_context and content_type != 'application/ld+json':
-                  if isinstance(linked_context, list):
-                      raise JsonLdError(
-                          'URL could not be dereferenced, '
-                          'it has more than one '
-                          'associated HTTP Link Header.',
-                          'jsonld.LoadDocumentError',
-                          {'url': url},
-                          code='multiple context link headers')
-                  doc['contextUrl'] = linked_context['target']
+                    if isinstance(linked_context, list):
+                        raise JsonLdError(
+                            'URL could not be dereferenced, '
+                            'it has more than one '
+                            'associated HTTP Link Header.',
+                            'jsonld.LoadDocumentError',
+                            {'url': url},
+                            code='multiple context link headers')
+                    doc['contextUrl'] = linked_context['target']
                 linked_alternate = parse_link_header(link_header).get('alternate')
                 # if not JSON-LD, alternate may point there
                 if (linked_alternate and
@@ -93,9 +92,8 @@ def requests_document_loader(secure=False, **kwargs):
                         not re.match(r'^application\/(\w*\+)?json$', content_type)):
                     doc['contentType'] = 'application/ld+json'
                     doc['documentUrl'] = prepend_base(url, linked_alternate['target'])
-                    if content_type not in headers['Accept']:
-                        # Original was not JSON/JSON-LD; fetch linked JSON-LD
-                        return loader(doc['documentUrl'], options=options)
+                    return loader(doc['documentUrl'], options=options)
+            doc['document'] = response.json()
             return doc
         except JsonLdError as e:
             raise e

JulienPalard avatar Oct 09 '20 09:10 JulienPalard

Hmm, that's a bit odd. Yes, Pyld adds text/html (in the line you pointed out, i.e. headers['Accept'] = headers['Accept'] + ', text/html;q=0.8, application/xhtml+xml;q=0.8') but that's a default header value and it shouldn't affect the server response as long as application/json+ld is included in the Accept header value with a higher precedence. That check

if content_type not in headers['Accept']:
    # Original was not JSON/JSON-LD; fetch linked JSON-LD
    return loader(doc['documentUrl'], options=options)

is important, it checks whether the server responded with application/json+ld, and if not, attempts to fetch the linked resource. There's no value doing this if the response is already JSON-LD.

alpha-beta-soup avatar Oct 11 '20 02:10 alpha-beta-soup

it shouldn't affect the server response as long as application/json+ld is included in the Accept header value with a higher precedence.

Totally agree. But looks like https://schema.org/ does not have a application/json-ld variant at all, so it always reply with text/html, independently of the Accept header. But this text/html response links to the ld+json, see:

$ curl -I -H "Accept: application/ld+json" https://schema.org
HTTP/2 200 
[...]
link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"
[...]
content-type: text/html
[...]

Which looks legit to me, even though I still didn't read and understood rfc8288 entierly.

That check

if content_type not in headers['Accept']:
    # Original was not JSON/JSON-LD; fetch linked JSON-LD
    return loader(doc['documentUrl'], options=options)

is important, it checks whether the server responded with application/json+ld, and if not, attempts to fetch the linked resource. There's no value doing this if the response is already JSON-LD.

IIRC It looks already covered by the:

not re.match(r'^application/(\w*+)?json$', content_type)):

check a few line before.

JulienPalard avatar Oct 11 '20 07:10 JulienPalard

Ah I see. Sorry it's been a while since I've looked at this. I think your version of the PR makes sense, and as long as both our tests pass, it gets my vote.

alpha-beta-soup avatar Oct 11 '20 08:10 alpha-beta-soup

and as long as both our tests pass, it gets my vote.

I'd like to add both tests to the test suite, but I don't understand how to do so. If someone can explain before merging I'd gladly do so.

JulienPalard avatar Oct 11 '20 10:10 JulienPalard