jsonschema icon indicating copy to clipboard operation
jsonschema copied to clipboard

Keep object key order when resolving a schema

Open marmeladema opened this issue 7 years ago • 13 comments

As explained in the title, when resolving a schema on Python version (< 3.6) where dict implementation does not keep key insertion ordering, the object key order can be lost. This can be especially problematic for the "properties" object that is usually found in schemas.

I would be glad to provide a PR to fix this. My approach would be to provide a json_kwargs as RefResolver.init argument and forward it to json.loads in resolve_remote, so that users that need this functionality (at least me) could pass the object_pairs_hook argument with an OrderedDict value.

Would you be ok with that ?

marmeladema avatar Oct 24 '18 09:10 marmeladema

Hey -- RefResolver is the only place that really deals with deserialization yeah -- this is probably related to python-jsonschema/referencing#4, since I imagine a solution for the first would solve this too, though obviously that ticket is more general. I haven't yet fully thought through what a decent interface looks like there, but maybe have a look, since I suspect there's one decent solution that covers both use cases.

Some random brainstorming:

What I'd initially imagine is an argument for a deserializer, and that deserializer being injectable with your version of JSON loading.

But, even better I suspect would be to have <HTTP layer> wrapped by <HTTP to Python Object Layer>, and then only the latter gets passed to RefResolver.

There are after all at least 3 distinct tasks here:

  • Take URL, find handler that knows how to retrieve it
  • Take handler (e.g. HTTP), make request, return response (this is unfortunately right now either requests or the urllib2 module)
  • Take (e.g. HTTP) response, decide how to deserialize it based on configuration + context

And yeah how many objects that is, and whether any or all of them is RefResolver is what I still haven't fully thought through, so comments definitely welcome :)

Julian avatar Oct 24 '18 12:10 Julian

The other issue is way more general than the problem i am facing right now. Also i am using the resolver locally with file:// scheme, no HTTP server involved. But as a first step, we could add a deserializer argument to RefResolver and call it in resolve_remote. Default value would still be json.loads. This way, you could solve the yaml problem and this issue at the same time. The .json() method from the requests module will not be used anymore though, as all deserialization will use the same code path (which i think if safer anyway).

marmeladema avatar Oct 24 '18 13:10 marmeladema

It is, yes, but I don't want 2 (which undoubtedly eventually becomes 3 and 4 and ..) when a decently designed interface solves both issues and seems likely to solve future ones.

If such a solution exists that'd be worth doing, which is why I say it needs thinking through before just adding arguments willy nilly.

On Wed, Oct 24, 2018, 09:29 marmeladema [email protected] wrote:

The other issue is way more general than the problem i am facing right now. Also i am using the resolver locally with file:// scheme, no HTTP server involved. But as a first step, we could add a deserializer argument to RefResolver and call it in resolve_remote. Default value would still be json.loads. This way, you could solve the yaml problem and this issue at the same time. The .json() method from the requests module will not be used anymore though, as all deserialization will use the same code path (which i think if safer anyway).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Julian/jsonschema/issues/481#issuecomment-432656209, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUIXi7J7cVqab3lW5zpqIilF4jfxYg2ks5uoGs2gaJpZM4X3hYQ .

Julian avatar Oct 24 '18 13:10 Julian

And by the way, requests' json method cannot just be replaced with json.loads unfortunately, which is almost the whole reason we use requests.

It implements what the json spec says is a correct unicode decode dance, which json.loads does not.

On Wed, Oct 24, 2018, 09:32 Julian Berman [email protected] wrote:

It is, yes, but I don't want 2 (which undoubtedly eventually becomes 3 and 4 and ..) when a decently designed interface solves both issues and seems likely to solve future ones.

If such a solution exists that'd be worth doing, which is why I say it needs thinking through before just adding arguments willy nilly.

On Wed, Oct 24, 2018, 09:29 marmeladema [email protected] wrote:

The other issue is way more general than the problem i am facing right now. Also i am using the resolver locally with file:// scheme, no HTTP server involved. But as a first step, we could add a deserializer argument to RefResolver and call it in resolve_remote. Default value would still be json.loads. This way, you could solve the yaml problem and this issue at the same time. The .json() method from the requests module will not be used anymore though, as all deserialization will use the same code path (which i think if safer anyway).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Julian/jsonschema/issues/481#issuecomment-432656209, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUIXi7J7cVqab3lW5zpqIilF4jfxYg2ks5uoGs2gaJpZM4X3hYQ .

Julian avatar Oct 24 '18 13:10 Julian

It is, yes, but I don't want 2 (which undoubtedly eventually becomes 3 and 4 and ..) when a decently designed interface solves both issues and seems likely to solve future ones.

Do you think a PR as a "work-in-progress" base of discussion could help here ? Not meant to be merged of course. Just so that people can comment, and we can go toward finding a solution.

And by the way, requests' json method cannot just be replaced with json.loads unfortunately, which is almost the whole reason we use requests. It implements what the json spec says is a correct unicode decode dance, which json.loads does not.

We can still use the guess_json_utf function from the requests module to handle this dance properly.

marmeladema avatar Oct 24 '18 13:10 marmeladema

Yep definitely makes sense on PR to propose something!

On Wed, Oct 24, 2018, 09:44 marmeladema [email protected] wrote:

It is, yes, but I don't want 2 (which undoubtedly eventually becomes 3 and 4 and ..) when a decently designed interface solves both issues and seems likely to solve future ones.

Do you think a PR as a "work-in-progress" base of discussion could help here ? Not meant to be merged of course. Just so that people can comment, and we can go toward finding a solution.

And by the way, requests' json method cannot just be replaced with json.loads unfortunately, which is almost the whole reason we use requests. It implements what the json spec says is a correct unicode decode dance, which json.loads does not.

We can still use the guess_json_utf function from the requests module to handle this dance properly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Julian/jsonschema/issues/481#issuecomment-432661778, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUIXmTkjdWyvZEZpI1AfjMgITVsK4j9ks5uoG6ngaJpZM4X3hYQ .

Julian avatar Oct 25 '18 02:10 Julian

I had a similar use-case; I wanted to add a timeout to the request. Based on RefResolver.resolve_remote, I replaced all calls to Draft7Validator with a factory method that created a validator + RefResolver with custom handlers (they take precedence):

def make_validator(schema, ..., timeout=TIMEOUT_IN_SECONDS):
    # ..., figure out base_uri etc

    def get_with_timeout(uri):
        return requests.get(uri, timeout=timeout).json()

    resolver = RefResolver(
        base_uri=base_uri,
        referrer=schema,
        handlers={"http": get_with_timeout, "https": get_with_timeout},
    )
    return Draft7Validator(schema, resolver=resolver)

This seems to work fine, although if the API settles down, I would just inherit from RefResolver and override resolve_remote to avoid weird fall-through cases.

Your problem is similar. Luckily, on recent requests version (don't know how recent), you can specify extra keyword arguments for .json() and still get all the sanity-preserving encoding handling:

from __future__ import absolute_import, print_function, unicode_literals
import requests
from collections import OrderedDict

url = "https://api.github.com/users/octocat/repos"
requests.get(url).json(object_pairs_hook=OrderedDict)

So you could use also use a function and custom handlers, e.g.:

def get_with_order(uri):
    return requests.get(uri).json(object_pairs_hook=OrderedDict)

Sure, it isn't super clean or optimal, and personally, I'd dump legacy Python and move to Python 3, but maybe this unblocks you?

(Also, if there's a reason not use do this, I'd like to know)

For the broader discussion, being able to mess around with RefResolver would be super useful because it does a lot, but I can understand the reluctance of having to support old designs.

tobywf avatar Nov 13 '18 00:11 tobywf

(Also, if there's a reason not use do this, I'd like to know) This seems to work fine, although if the API settles down, I would just inherit from RefResolver and override resolve_remote to avoid weird fall-through cases.

Subclassing isn't a supported public API for any classes in jsonschema :), but otherwise this looks reasonable.

For the broader discussion, being able to mess around with RefResolver would be super useful because it does a lot, but I can understand the reluctance of having to support old designs.

Definitely keen to hear if you have any use cases that wouldn't be solved by what's in python-jsonschema/referencing#4!

Julian avatar Nov 13 '18 18:11 Julian

It would not work for me as i don't not want to provide a handler for a specific protocol, i need ordering whatever the protocol. I can not even fake the handlers argument because it is transformed as dict in the constructor.

marmeladema avatar Nov 13 '18 18:11 marmeladema

Fair enough, was hoping one of those approaches helps, but guess it's a niche use-case. My understanding is JSON specifies objects are unordered.

Anyway, I'll add my thoughts to python-jsonschema/referencing#4 , didn't mean to hijack the issue!

tobywf avatar Nov 13 '18 19:11 tobywf

@Julian I recently observed a very annoying bug where validate() randomly returns one of two errors for the same schema file. Also all currently supported versions of python do keep the dictionary key order, so no longer have to worry about old behaviors. Sadly it not really possible to build a small reproducible use case as that order seems to be relatively stable for subsequent runs on the same machine, but after reboot some other events it can flip.

-roles/ensure-kubernetes/tasks/minikube.yaml:1: schema: True is not of type 'string' (schema[tasks])
[186](https://github.com/ansible/ansible-lint/runs/6526804029?check_suite_focus=true#step:7:187)
+roles/ensure-kubernetes/tasks/minikube.yaml:1: schema: False is not of type 'string' (schema[tasks])

Taken from https://github.com/ansible/ansible-lint/runs/6526804029?check_suite_focus=true

As my schemas are already sorted, i was hoping that we might also get the same output. Also our schemas do not have external references so I am not sure what is really causing this to behave like this. Should I file a new bug for this or reuse this bug?

I was able to narrow down it a little bit as this how the intance dictionary looks like:

  environment:
    MINIKUBE_WANTUPDATENOTIFICATION: false
    MINIKUBE_WANTREPORTERRORPROMPT: false
    MINIKUBE_WANTNONEDRIVERWARNING: false
    MINIKUBE_WANTKUBECTLDOWNLOADMSG: false
    CHANGE_MINIKUBE_NONE_USER: true
    MINIKUBE_HOME: "{{ ansible_user_dir }}"
    KUBECONFIG: "{{ ansible_user_dir }}/.kube/config"

My schema is validating that all key:values from this dictionary must be strings. As you can see multiple entries are not string, so basically the entire mapping has multiple errors.

I suspect that in that case the found errors are not sorted in any way and I endup reporting one of them, a random one. Probably need need to sort these to get a more predictable behavior. I am not keen into going towards reporting all errors because that can be overwhelming for the user, especially as one error can have undesired side effects, causing other false ones.

ssbarnea avatar May 20 '22 15:05 ssbarnea

A new bug with a minimal reproducer would definitely help. I don't know that we can promise any order to Validator.iter_errors, but if there's a simple way the library messes up the order of an existing ordered dict that should be preserved indeed.

Julian avatar May 20 '22 15:05 Julian

@Julian Lets switch to python-jsonschema/jsonschema#593 to avoid spamming this issue. I was able to reproduce it, now we need to find a way to improve the situation and make the outcomes more predictable.

ssbarnea avatar May 20 '22 16:05 ssbarnea

We (the library) no longer support <3.6 so I think this is closeable. If that's not the case and there's still what to address here feel free to follow up.

Julian avatar Nov 01 '22 18:11 Julian