jsonschema Don't share scope stacks between calls to RefResolver

Not sure if this is really a goal or not. The top level call to validate(...) seems fine since it constructs new Validator instances for every call.

The main issue I see here is that a stack (_scopes_stack) in the RefResolver instance is shared and modified by all calls to iter_errors() rather than being call specific.

May 08 '15 16:05 ltucker

Hey!

It's not explicitly a goal to be threadsafe, generally I'd recommend just creating a different validator for each thread, but having _scope_stack not be scoped to each call might be a thing worth fixing for cleanliness reasons if we can come up with a decent way to do so.

May 10 '15 14:05 Julian

In a multithreaded situation, you could make a different RefResolver per thread, but then each thread would be separately caching the same data, which seems pretty wasteful.

Moving RefResolver's _scope_stack into a thread local would not be a difficult modification. I'd be happy to submit a pull request that does just that. However, there's a complication...

Right now I'm running into this issue while using Bravado. I have a multithreaded app that uses yelp/Bravado (and related project yelp/swagger_schema_validator) to make requests to other services, and the stack in the RefResolver gets thoroughly messed up.

The complication is, they're modifying the _scope_stack directly. Take a look at this:

https://github.com/Yelp/swagger_spec_validator/blob/v2.1.0/swagger_spec_validator/ref_validators.py#L178

They are using a context manager that completely switches out the scope stack - relying on an internal variable to do so. This leads to the bad situations like:

thread 1 pushes to the stack. Stack length: 2
thread 2 switches the stack with one where nothing has been pushed. Stack length: 1
thread 1 pops the stack. Stack length: 0
either thread calls resolution_scope(), resulting in IndexError

The simplest way to fix the issue at the jsonschema level is to make a change like this to RefResolver:

    @property
    def _scopes_stack(self):
        # A resolver should be usable from multiple threads.
        # Thus, each thread should see its own stack. Use threading.local()
        # to make this happen.
        try:
            return self.thread_local.scopes_stack
        except AttributeError:
            self.thread_local.scopes_stack = list(self._initial_scopes_stack)
            return self.thread_local.scopes_stack

(plus appropriate changes in the constructor)

However, since Bravado is making assignments to the _scope_stack variable, that will break their use case. We could add additional code here to support that use case:

    @_scopes_stack.setter
    def _scopes_stack(self, value):
        self.thread_local.scopes_stack = value

Upside is, their code continues to work without an issue. Downside is, jsonschema code just got unnecessarily bloated to support a use case that's not part of the external API.

So, options are:

1. make the first change; their use cases will break and they'll be forced to fix them.
1. make both changes.
1. make the first change and add an explicit with_scope_stack() method to your external API; they'd still need to change their code, but it would be a change from using an internal variable to a public API, so everyone wins.

Of course in case 1 or 3 it'd be a good idea to let those folks know in advance. No reason to be mean about it. For 3) it'd be pretty easy to have a pull request ready to go for them, too, as part of the change.

I'd be happy to submit a pull request here for any of these options, but I'd like to hear which solution the jsonschema maintainers prefer before doing that. There's some philosophy of development stuff in there, and I can see arguments for all sides.

This could also be fixed in bravado_core or swagger_schema_validator, but as far as I can see most of the clean ways to fix it there will result in repeated caching.

Jul 12 '17 18:07 ostacey

Hey! Sorry it took me so long to get to this, finally back home.

That module is... pretty disappointing :/. I'd like to spend a bit of time understanding why they did some of the things in there -- there are at least a few that have what I thought were sane public APIs to do what is being done in unsafe ways there, so maybe there's a doc gap (very likely), but there's certainly also some functionality there that might be useful to have upstream. Do you work on bravado or are you a user?

In general here by the way, I think RefResolver needs a redesign entirely (which I chucked some notes up on python-jsonschema/referencing#3 but they're just chicken scratch. I've played around a bit but haven't gotten anything concrete pushed out yet, I'm kind of going back and forth between that and Draft 6 support in the limited spare time I have.

As for your initial premise by the way:

In a multithreaded situation, you could make a different RefResolver per thread, but then each thread would be separately caching the same data, which seems pretty wasteful.

I'm not sure how wasteful that seems to me -- how many refs are you resolving and how large are they that duplicating some dicts a few times is meaningful?

I'm not saying I'm against the change, if it comes with tests that suitably cover it and is reasonably compact I'm OK with .. probably "i" if we can manage to ping a contributor to bravado and chat about why they need to do that, and if not, then "ii".

Thanks certainly for bringing this back up again!

Aug 01 '17 12:08 Julian

Hello there!

This, along with many many other $ref-related issues, is now finally being handled in #1049 with the introduction of a new referencing library which is fully compliant and has APIs which I hope are a lot easier to understand and customize.

The next release of jsonschema (v4.18.0) will contain a merged version of that PR, and should be released shortly in beta, and followed quickly by a regular release, assuming no critical issues are reported.

Specifically, the new APIs don't have any shared state like this, so even though it's still the case that no thread safety has even been thought about whilst writing the libraries, let alone promised, this should definitely work better than before.

If you still care to, I'd love it if you tried out the beta once it is released, or certainly it'd be hugely helpful to immediately install the branch containing this work (https://github.com/python-jsonschema/jsonschema/tree/referencing) and confirm. You can in the interim find documentation for the change in a preview page here.

I'm going to close this given it indeed seems like it is addressed by #1049, but feel free to follow up with any comments. Sorry for the delay in getting to these, but hopefully this new release will bring lots of benefit!

Feb 23 '23 09:02 Julian