tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

`RefMutContainer` is unsound

Open CheaterCodes opened this issue 1 year ago • 3 comments

Since I was made aware of an unsoundness in this code base today (see #1485), I decided to have a bit of a look around.

I found that RefMutContainer erases lifetimes, resulting in more unsafe code. Source location: https://github.com/huggingface/tokenizers/blob/14a07b06e4a8bd8f80d884419ae4630f5a3d8098/bindings/python/src/utils/mod.rs#L44-L66

Note that I didn't check if there is actually any unsoundness here. However, since the safety here can't be locally guaranteed, the methods should at least be marked as unsafe, but even better, fixed to be correct.

Here is an example code that shows undefined behavior when checked with Miri:

let container = {
    let mut content = String::from("foo");
    RefMutContainer::new(&mut content)
};

container.map(|text| {
  // Triggers UB
  println!("{text}");
});

Link to playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=fede2ac43a1fdb8b5e499b505e9ecdea

Error produced by miri:

error: Undefined Behavior: out-of-bounds pointer use: alloc1256 has been freed, so this pointer is dangling
   --> /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/raw.rs:138:9
    |
138 |         &*ptr::slice_from_raw_parts(data, len)
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ out-of-bounds pointer use: alloc1256 has been freed, so this pointer is dangling
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
help: alloc1256 was allocated here:
   --> src/main.rs:29:20
    |
29  |     let mut text = String::from("foo");
    |                    ^^^^^^^^^^^^^^^^^^^
help: alloc1256 was deallocated here:
   --> src/main.rs:31:5
    |
31  |     drop(text);
    |     ^^^^^^^^^^
    = note: BACKTRACE (of the first span):
    = note: inside `std::slice::from_raw_parts::<'_, u8>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/raw.rs:138:9: 138:47
    = note: inside `<std::vec::Vec<u8> as std::ops::Deref>::deref` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:2831:18: 2831:64
    = note: inside `<std::string::String as std::ops::Deref>::deref` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/string.rs:2487:43: 2487:52
    = note: inside `<std::string::String as std::fmt::Display>::fmt` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/string.rs:2376:28: 2376:34
    = note: inside `<&std::string::String as std::fmt::Display>::fmt` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/fmt/mod.rs:2377:62: 2377:82
    = note: inside `core::fmt::rt::Argument::<'_>::fmt` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/fmt/rt.rs:173:76: 173:95
    = note: inside `std::fmt::write` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/fmt/mod.rs:1178:21: 1178:44
    = note: inside `<std::io::StdoutLock<'_> as std::io::Write>::write_fmt` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/io/mod.rs:1823:15: 1823:43
    = note: inside `<&std::io::Stdout as std::io::Write>::write_fmt` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/io/stdio.rs:786:9: 786:36
    = note: inside `<std::io::Stdout as std::io::Write>::write_fmt` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/io/stdio.rs:760:9: 760:33
    = note: inside `std::io::stdio::print_to::<std::io::Stdout>` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/io/stdio.rs:1116:21: 1116:47
    = note: inside `std::io::_print` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/io/stdio.rs:1226:5: 1226:37
note: inside closure
   --> src/main.rs:35:7
    |
35  |       println!("{text}");
    |       ^^^^^^^^^^^^^^^^^^
note: inside `RefMutContainer::<std::string::String>::map::<{closure@src/main.rs:33:19: 33:25}, ()>`
   --> src/main.rs:18:14
    |
18  |         Some(f(unsafe { ptr.as_ref().unwrap() }))
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `main`
   --> src/main.rs:33:5
    |
33  | /     container.map(|text| {
34  | |       // Triggers UB
35  | |       println!("{text}");
36  | |     });
    | |______^
    = note: this error originates in the macro `println` (in Nightly builds, run with -Z macro-backtrace for more info)

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

Edit: Some updates for clarity

CheaterCodes avatar Aug 13 '24 17:08 CheaterCodes

Similar reproduction as above, but just showing that it can easily be even more innocent than an explicit drop

let container = {
    let mut text = String::from("foo");
    RefMutContainer::new(&mut text)
};

MolotovCherry avatar Aug 13 '24 17:08 MolotovCherry

Similar reproduction as above, but just showing that it can easily be even more innocent than an explicit drop

let container = {
    let mut text = String::from("foo");
    RefMutContainer::new(&mut text)
};

Haha, race condition! I just updated it with your suggestion :)

CheaterCodes avatar Aug 13 '24 17:08 CheaterCodes

Looking at this a bit closer, I can't think of any situation where this isn't

  • horribly unsound or
  • couldn't just use references internally.

This struct should either

  • use raw pointers throughout and never dereference them
  • store a &mut T internally instead of a *mut T

CheaterCodes avatar Aug 13 '24 17:08 CheaterCodes