gdext icon indicating copy to clipboard operation
gdext copied to clipboard

Deal with cross-thread access to user-defined GodotClass instances

Open chitoyuu opened this issue 3 years ago • 20 comments

Currently, RefCell and i32 are used to store user-defined GodotClass instances and their reference counts:

https://github.com/godot-rust/gdextension/blob/a64f22409e05d1353492fded4aa36a4b4e6b710d/godot-core/src/storage.rs#L15-L21

In a multi-threaded configuration of the engine, this may lead to unsoundness when the instances are accessed by GDScript or engine code. Noting that GodotClass does not have a Send or Sync requirement, this can be as simple as a non-Send field in the type.

@Bromeon in Discord:

I plan to have a mix of multiple things:

  • the "you need to take some responsibility of the non-Rust code"
  • no access to Godot thread APIs as direct Rust classes (Thread, Mutex, Semaphore) by default, and > unsafe access when there is a threads feature enabled
  • GDScript->Rust calls could validate the thread ID and compare it with the one through which the extension library was initialized (which I hope is the main thread) A Cargo feature thread would loosen those restrictions and give access to multithreading, in an unsafe way. We could then either conditionally implement Send/Sync on Gd<T>, or maybe better, provide an explicit Send/Sync type to transport references across thread boundaries. Users would still need to adhere to the Godot thread guidelines.

As prior art, the GDNative bindings deals with the problem with the user-data wrapper system.

chitoyuu avatar Nov 08 '22 13:11 chitoyuu

Re-labeling this as feature for general discussion about multithreading support.

Bromeon avatar Jan 22 '23 16:01 Bromeon

To elaborate on my original plans a bit and keep the discussion public:

As mentioned in https://github.com/godot-rust/gdext/pull/212#issuecomment-1493452625, some operations are currently unsound, for example:

  • when storing a Gd<T> equivalent reference on GDScript side
  • through the scene tree
  • Gd::upcast()
  • Gd::from_instance_id()
  • Gd::from_variant()
  • ...

So individual unsound access points like from_instance_id are mostly symptoms, the core problem is that Gd is not generally safe to construct on other threads.

The semantics here are actually more limiting than Rust's own:

in general, Gd<T> can only be safely constructed and used on the main thread.

This is very important, as Gd<T> is not an isolated instance (as it were in Rust). Through the engine, there can be lots of hidden links to other objects (scene tree, connected signals, base classes...). In other words, Gd<T> potentially has side effects transcending thread boundaries. This concept is not expressible using Rust standard traits; Rust generally assumes that if you have ownership of something, it's unique; and if it's shared across threads, it's Sync.


So, what can we do? My idea is still more or less the same from the original post.

  1. When threads feature is off: we strictly forbid construction of Gd<T> in other threads.

    • Validated through thread ID and panics.
  2. When threads feature is on: we need to differentiate which types are "thread-safe" according to Godot.

    • This not only concerns classes, but also builtins, especially the composite ones (Variant, Array, PackedArray, Dictionary, Callable).
    • Servers and Godot singletons should be constructible from any thread, and shareable across thread boundaries.
    • Resources can be loaded on one thread and sent to another (Send) and read concurrently from different threads (Sync), but not modified concurrently.
  3. Once we have a model, we can turn it into the type system:

    • Traits for properties that can be statically verified (like "this type is safe to construct on non-main threads").
    • Runtime checks and panics for properties that cannot be verified at compile time.
    • Escape hatches via unsafe for cases that need to be supported but don't fit into the type system. Ideally, this surface should be small.

Bromeon avatar Aug 24 '23 08:08 Bromeon

i still kinda feel like it'd be better to just have two separate Gds. One that is only meant to be used on the main thread and one that is able to be used concurrently. maybe more idk we can make it as granular as we like there.

mainly because i really think that making Gd thread safe will be a very invasive procedure and likely hurt the ergonomics a fair bit. several methods may need to be made unsafe, etc. and having it be a feature that changes the behavior of Gd would mean it's either all or nothing.

im addition, how do we document that? a separate type is easy to document, but documenting all the changes from this feature seems like it'd be very confusing and weird. i imagine there is a way to do it in the docs but i cant imagine it'd look nice.

in addition, for the other core classes we can probably just make them Send/Sync if they are. all the value-types are already, since they're just implemented in rust. and for Dictionary and Array we can probably make a thread safe version of each.

anyway let's say we split Gd into Gd and GdSync (temp names). we could just make Gd check that it's always on the main thread when created, and have it not implement Send or Sync ever. (or maybe we can allow it in very specific cases, like perhaps the singletons? that might also just be confusing though).

We could also potentially use object metadata to store what thread affinity an object should have, if we want to allow Gd to be used on more than just the main thread. But that would also mean we'd need to ensure the user doesn't override the metadata somehow. unless, can we set instance bindings on arbitrary objects? if so then we might be able to use that instead.

GdSync would then have a more restricted api. likely more unsafe methods, and probably some way to track whether it should be Send or Sync. i dont know if we can do that by just implementing Send/Sync directly on the classes. at least for Sync that would allow:

let foo = Gd::<Node>::from_instance_id(id);
let node: &Node = *foo;
thread::spawn(|| node.get_child())

maybe it is fine to allow this for classes that actually are Sync (Node probably isn't).

Also there's a distinction i just thought of:

  • some classes can always be safely sent/shared between threads
  • some classes may be safely sent/shared between threads as long as the object is actually that class and not a subclass

generally the first one of these will be true if the entire subtree of the class hierarchy is also send/sync. this would technically include user-classes but we can perform a check in user classes to make sure they're not created/shared where they shouldn't be.

the second one may be true in some cases, for instance it's true for Object i believe.

we may also have some cases where some individual methods are safe to call from other threads. i suspect this may be a bit too hard to model properly though, may be feasible through an unsafe escape hatch. If we do add a RawGd we can use that as the escape hatch, RawGd would be Send and Sync, but most methods on it are unsafe.

There are also a couple other possibilities for dedicated smart pointers we could make if we wanted to. One I've thought of would be analogous to the Unique typestate in gdnative. Essentially a GdBox, which would be probably the most permissive of all the types. However it can only be safely constructed fresh, and unsafely converted from another Gd type. But also can be safely consumed and turned into other gd types. It's also freed on drop. We might actually be able to safely convert RefCounted objects into this type, since we can check the refcount to see that we're the only one owning it. This would likely be very useful in dealing with Resource in that case.

lilizoey avatar Aug 24 '23 12:08 lilizoey

Thanks a lot for the great feedback!

Yep, it might be better to have its dedicated type GdSync (or Gds, SyncGd, Agd or whatever -- I'll use Gds for brevity here) if the APIs differ too much.

Proliferation of public types with closely related meaning, such as Gd, GdSync, RawGd, GdBox, ... goes a bit against our declared Simplicity principle. As a user wanting to develop a game, I'm usually not interested in selecting out of 4+ smart pointer types and memorize all the subtle differences and use cases, if GDScript or C# just allows me to get stuff done. But probably we will see if there's a real need for those once things get more concrete :slightly_smiling_face:

Additionally, there are a lot of things we cannot check at compile time either way, by the nature how Godot operates.

You mentioned a good one: if Gds<Base> can point to Gds<Derived>, we need to ensure that all derivates follow the same constraints. Which creates non-locality: adding a new class deriving from Base may break existing usages of Gd<Base> (at runtime). Or it must be guaranteed that polymorphism (up/downcasting) is only possible when derived classes implement the same traits.

Alternatively, we could outlaw polymorphism on Gds and enforce that Gds<T> always hold exactly T. This might be an acceptable trade-off, since synced pointers are usually short-lived (e.g. load resource in background thread, do a parallel computation with a result, etc).

Bromeon avatar Aug 24 '23 13:08 Bromeon

Just gonna summarize general thread safety issues for the various types. When we focus more on thread safety, we'll likely need to split all the thread unsafe types in two to create a thread-safe variant for each. But there are several types which likely dont need such a split, and some types may only require slight modifications to be made safe.

Known thread safe

  • All the mathy types are implemented in rust and thus are entirely thread-safe
  • integers, floats, and bool

Likely thread safe

  • StringName is immutable + atomically refcounted and should therefore be thread safe
  • Packed*Array except String, these should be largely COW types so can't be mutated concurrently from different threads
  • Gd<T> where T is a godot-singleton, it may be that our assumptions in making Gd makes this thread unsafe but usage of the godot singletons is thread safe

Likely thread unsafe

String types can be easily cloned into a shared reference and then mutated afterwards, this is likely gonna cause concurrency issues.

  • GString is mutable and thus likely not thread safe
  • NodePath is mutable and thus likely not thread safe
  • PackedStringArray, GString is likely not thread safe so this likely isn't either

Known thread unsafe

  • Array, allows arbitrary sending and sharing of other godot types, even Array<i64> cannot in a thread-safe manner always maintain its safety invariant
  • Dictionary, allows arbitrary sending and sharing of other godot types
  • Variant, it can be any of the other types
  • Gd<T>, intentionally thread unsafe, will likely need a different type for thread safe object access - see above
  • Callable, arbitrary code may be run
  • Signal, arbitrary code may be run

lilizoey avatar Feb 17 '24 17:02 lilizoey

This is a great overview! 👍

If we want to provide ergonomic usage, we need to think if/how we can support common use cases such as:

  • load resources on a separate thread
  • construct a node on a separate thread, send it to main, attach to tree there

Some observations:

  1. Certain methods like Object::call, Object::get, Node::call_group etc. must be excluded from thread-safety guarantees, since those can invoke arbitrary code (Rust and non-Rust).
    • This is already a problem in regular Gd<T>.
  2. Some node interactions must be run on main, e.g. scene-tree related ones like Node::add_child, Node::add_sibling, Node::reparent.
    • I don't know how to verify this unless we annotate each method manually, which I'd like to avoid if somehow possible.
  3. Other options like Node3D::set_position (most properties in general) are fine to execute on any thread, as long as there is no concurrent access to the same object from another thread.
    • Unique objects (e.g. newly created ones) are always safe to use.
    • Shared objects obey the borrow principle: 1 writer XOR many readers.
    • Node is not reference-counted, so it's impossible to know how many references exist.
    • However, at least from Rust side we could provide best-effort diagnostics in Debug mode: method calls check a global lock for that instance ID, and panic if the lock is being held. This can't be done in GDScript code, but we need to live with that being outside our guarantees.

As for unsafe usage, we need to see where it makes sense -- if every single cross-thread method becomes unsafe, we are back to gdnative's approach and don't win much. I'm particularly considering how to map Send (allow moving but not sharing) and Sync (allow shared access) to the above-mentioned semantics.

Bromeon avatar Feb 18 '24 13:02 Bromeon

One thing we could do is have a Gd type with Box-like semantics, similar to gdnative's Unique typestate. Though a different type entirely instead of typestates may be more appropriate.

Its safety invariant is that it has unique ownership of the object. This would let us much more easily use it safely from different threads, it could be Send and maybe even Sync. Since we know there is only one reference to this object, then it should be entirely safe to use the vast majority of methods on it. It may be a bit difficult to figure out what methods could break the safety invariant though. We could then have a way to safely convert it to a normal Gd (consuming the original object). This would let us support patterns like "do things on other threads, send to main thread, then use like normal".

lilizoey avatar Feb 20 '24 17:02 lilizoey

We could also start very conservatively, and limit the way how this UniqueGd (or however it's called) can be used:

  • ✔️ any Deref/DerefMut API methods on it (otherwise it's useless).
    • (We could potentially mark some unsafe if really appropriate, but this won't scale well)
  • ✔️ constructors new_alloc, new_gd, from_init_fn, from_object
  • cast, try_cast (consumes object), upcast, upcast_ref, upcast_mut (basically Deref)
  • ✔️ free (consumes object)
  • ✔️ instance_id
  • ✔️ bind, bind_mut
  • ✔️ into_gd (consumes object)
  • ✔️ pass to Godot APIs by consuming the object

These operations are absent:

  • clone(), obviously
  • from_godot and from_variant
  • to_godot and to_variant

These operations could be marked unsafe (or also absent):

  • ✖️ from_gd() or from_gd_unchecked() for manually managed types
    • could involve best-effort diagnostics in debug mode, but quite involved
  • ⚠️ from_gd() for ref-counted types, which is dynamically checked and safe
  • ✖️ from_instance_id()
    • can initially also be absent, as this may not even be needed; can also be trivially achieved with a from_gd constructor
  • ✖️ pass to Godot APIs but via reference

Bromeon avatar Feb 20 '24 19:02 Bromeon

We could also start very conservatively, and limit the way how this UniqueGd (or however it's called) can be used:

* ✔️ any `Deref`/`DerefMut` API methods on it (otherwise it's useless).

  * (We could potentially mark some unsafe if really appropriate, but this won't scale well)

We do need to double check some of this, Deref is almost certainly fine. But DerefMut might have some issues where we could for instance do:

let mut foo = UniqueGd::<Node>::new_alloc();
let bar = Gd::<Node>::new_alloc();
foo.add_child(bar.clone());

send_to_main_thread(foo); // Also sends `bar` to the main thread?

bar.some_function(); // And we have a reference to `bar` on both threads?

Maybe if there was some way to limit the internal APIs such that functions called through DerefMut for UniqueGd only accepts UniqueGds...

* ✔️ `into_gd` (consumes object)

* ✔️ pass to Godot APIs by consuming the object

Both of these could maybe just be a From<UniqueGd> for Gd impl.

lilizoey avatar Feb 21 '24 16:02 lilizoey

We do need to double check some of this, Deref is almost certainly fine. But DerefMut might have some issues where we could for instance do:

Good observation -- I checked Node and it seems like most "dangerous" methods actually take &mut self. We'll possibly find one or another outlier, but this might greatly help at keeping the &self methods ergonomic 🙂


Maybe if there was some way to limit the internal APIs such that functions called through DerefMut for UniqueGd only accepts UniqueGds...

We could maybe reuse the trick from std::thread::spawn() and require a closure with a Send bound?

So mutable methods could not be implicitly called through DerefMut, but would need something like

unique.with_mut(move |this| {
    this.add_child(other_unique);
});

Bromeon avatar Feb 21 '24 17:02 Bromeon

So i experimented a bit with making a UniqueGd, and there are some extra issues that come up

Which classes can be UniqueGd?

Any engine classes can, but what about rust user classes? If the user class doesn't store the Base field and uses the default init implementation, then this should be safe. But what if you want Base access?

If you can access Base then you can just do self.to_gd() at which point you've aliased the UniqueGd, which would be library UB.

So we'd need to prevent people from doing things like this, maybe a UniqueBase is needed? But that adds further new types, maybe it'd be possible to add something to Bounds which says whether Self can be UniqueGd or not, which lets us then determine what functionality Base exposes.

But then users need to opt-in to being usable in Unique which is maybe not ideal? Or maybe it is, it's nice to be explicit sometimes.

Mutable references

As soon as you have a &mut T reference to the base object, it's possible to use std::mem::swap to break the invariant. Simply do something like this:

let unique = UniqueGd::new();
unique.with_mut(|obj| {
    let mut new_gd = Object::new_alloc();
    std::mem::swap(&mut *obj, &mut *new_gd);
    let cloned = new_gd.clone();
    std::mem::swap(&mut *obj, &mut *new_gd);
    // Woops, now `cloned` aliases `unique`, not ok
});

This is essentially #23 again, however with UniqueGd this would actually be unsound. Requiring us to either make any mutable access to such an object unsafe, or to make some bigger changes to actually fix #23, such as the supertrait method.

lilizoey avatar Jun 13 '24 01:06 lilizoey

Great observations!

If you can access Base then you can just do self.to_gd() at which point you've aliased the UniqueGd, which would be library UB.

So we'd need to prevent people from doing things like this, maybe a UniqueBase is needed? But that adds further new types, maybe it'd be possible to add something to Bounds which says whether Self can be UniqueGd or not, which lets us then determine what functionality Base exposes.

But then users need to opt-in to being usable in Unique which is maybe not ideal? Or maybe it is, it's nice to be explicit sometimes.

I think we need to strike a balance between static type safety (and the messy APIs it can cause) and ergonomics. I'm fine with panicking in such cases, a descriptive message could tell the user immediately what's wrong -- often even more so than weird bound mismatches. This would also follow ideas here.

As soon as you have a &mut T reference to the base object, it's possible to use std::mem::swap to break the invariant.

Hm, that indeed looks like a hard problem. Once you have &mut access to the object, one can always move in/out...

Spontanous idea that I haven't tested would be to just provide a temporary proxy (bound to scope) that forwards method calls to the original object, or something that doesn't deref-mut to Object directly. But that would also need either a fundamental redesign or at least a duplication of the API methods 🤔

Bromeon avatar Jun 13 '24 11:06 Bromeon

If we want uniqueness (or general multi-threading), we will likely need runtime information.

In my comment above, I mentioned that UniqueGd::from_instance_id() could be unsafe/absent, but this alone doesn't solve the problem. For example, what prevents one from doing:

let unique: UniqueGd<Node> = ...;
let id = unique.instance_id();

let secret_clone: Gd<Node> = Gd::from_instance_id(id); // regular Gd!

Making UniqueGd::instance_id() unsafe (as it can cause follow-up UB) isn't a great approach. There are other backdoors like Debug which can be parsed for the ID, etc...

And it doesn't solve a problem that doesn't even involve UniqueGd:

let one: Gd<Node> = ...;
let id = one.instance_id();
thread::spawn(|| {
    let two: Gd<Node> = Gd::from_instance_id(id);
});

On the other hand, we could add meta-information (Object::set_meta) that is verified by Gd::from_instance_id() and causes a panic. If we don't trust the Godot meta system because it's user-accessible, we can also keep our own

static UNIQUE_OBJECTS: Global<HashSet<InstanceId>>

In both cases, we need to trace all UniqueGd -> Gd conversions, to remove the "unique" meta flag. If we have our own hash set, we additionally need to trace all destructions, to avoid an ever-growing set.

Bromeon avatar Jun 14 '24 10:06 Bromeon

One thing to note is that these methods also exist separately on godots end, and we can't always guarantee that the godot versions of these aren't called by users since there's always Object::call.

lilizoey avatar Jun 14 '24 18:06 lilizoey