string-interner Make string interner and backends compatible with more types

This PR rewrites the interner to be generic, making it compatible with many more types.

~~Now the interner explicitly needs a backend parameter, because the type checker cannot resolve &str as the input argument to str based backends. (There are more than one impl AsRef for the &str type). One solution would be to restrict T to Borrow, with the downside of making it less ergonomic to pass a &[String] to the extend method of the interner. I'd like to hear your ideas :)~~
Had to add a couple of custom trait bounds, one of them being unsafe by the nature of the BucketBackend.
Some bounds are overly restrictive, like using <S as ToOwned>::Owned as the buffer type of the StringBackend, but that made the API a bit more cleaner with less generic parameters. Would it be preferrable as it is or do you prefer more general types?

Comments are very much appreciated :)

Feb 21 '22 07:02 jedel1043

Hi and thanks for your PR! Looks interesting! Usually for bigger changes it is nice to have an issue up before a PR emerges to allow for discussion beforehand.

I have yet to grasp what it actually provides users of this library in essence. So will it be possible to use C-style strings or OsString with StringInterner or is this even more generic? Some motivation where you describe the state of the string-interner crate, its limitations and problems and what this PR fixes (and evtl. how) would be very much approeciated.

I also conducted the compiled cargo doc of this PR and saw some new types. Most notably Span in backend::string and backend::bucket and the FixedString and FixedVec in backend::bucket. Would be nice if you could describe why this PR requires us to expand the API of the crate.

I also conducted some benchmarks using cargo bench and unfortunately found that this PR worsens the performance of the bucket backend and string backend significantly in some cases.

Feb 21 '22 16:02 Robbepop

Thank you for the quick response!

Usually for bigger changes it is nice to have an issue up before a PR emerges to allow for discussion beforehand.

Noted for the next time!

I have yet to grasp what it actually provides users of this library in essence. So will it be possible to use C-style strings or OsString with StringInterner or is this even more generic? Some motivation where you describe the state of the string-interner crate, its limitations and problems and what this PR fixes (and evtl. how) would be very much appreciated.

In Boa, a Javascript interpreter, we use string-interner to intern string literals embedded in the code. However, we're transitioning to an UTF-16 based string and this interner is currently unable to intern arbitrary [T] slices. (Specifically we need support for [u16] slices) I thought of just providing a [u16] implementation PR and calling it a day, but since I saw that in #39 you had a discussion about allowing another string type, I just thought that it would be ideal to make the interner and the backends generic to any custom string. This PR provides full capabilities for using the interner with any custom type, provided that the user implements the correct traits for its custom type. This also makes it very easy to extend the interner to be compatible with OsStr, CStr and some others, you just implement the required traits and that's it!.

I also conducted the compiled cargo doc of this PR and saw some new types. Most notably Span in backend::string and backend::bucket and the FixedString and FixedVec in backend::bucket. Would be nice if you could describe why this PR requires us to expand the API of the crate.

IIRC Span was already in backend::String, I just reordered a bit the exports because of the new trait associated with the BufferBackend, and to have symmetry with each backend.

The most notable additions to the API are FixedContainer, FixedVec and Sliced.

FixedContainer is a generalization of FixedString, a container fixed in capacity that doesn't reallocate on each push. Had to make it unsafe because there's currently no way to know whether a reallocation of the container occurred, so it must be responsibility of the user to guarantee that invariant.

FixedString already existed, just exposed it to be explicitly passable as a generic parameter for BufferBackend.

FixedVec is the one that allows the user to utilize the backend with any [T] slice, you just need to provide the type of the slice. It's pretty similar to FixedString, but without the cast from [u8] to str. Of course, the user could have implemented FixedVec by its own, but I think providing common containers for some std types would be better for ergonomics. Also, I realized the trait could be implemented for the raw Vec and String types, but I left it as it is until I heard your thoughts about it.

Finally, Sliced is just a trait to represent a type T that is bidirectionally convertible to a [U] slice. This bound was needed to generalize the StringBackend and the BufferBackend. It isn't marked as unsafe because the implementor can assume that the provided input &[U] is a valid representation of T, since the backends only convert full strings between each representation. Implemented it for str and [T] and should be very easy to implement for more std string types.

I also conducted some benchmarks using cargo bench and unfortunately found that this PR worsens the performance of the bucket backend and string backend significantly in some cases.

That's... weird. Maybe the generics are adding additional indirection to the trait methods? Probably inlining should fix that. May I know specifically which benchmarks have regressed?

Feb 21 '22 18:02 jedel1043

Thanks for your elaborate explanations!

That's... weird. Maybe the generics are adding additional indirection to the trait methods? Probably inlining should fix that. May I know specifically which benchmarks have regressed?

I think it is best if you just run the benchmarks on your own computer to get a feeling. Just checkout master branch, run cargo bench, then checkout your branch rerun cargo bench and see the difference immediate in your console. The regressions on my system were quite significant in the range of 50%-100%. Maybe you are right that it can be fixed with inline.

With respect to the suggested API I am always looking for the simplest design. Yet, I think having all this flexibility with Sliced, FixedVec and exposing internal types such as Span won't get us there. The ideal API was to just introduce a single generic parameter to the Backend trait with a default to String (or str depending on the design) and maybe some trait that defines the behavior that must be fulfilled by such a String type that can be used by the string-interner crate. That's at least how I imagine the simplest API for this feature.

Feb 21 '22 18:02 Robbepop

I think it is best if you just run the benchmarks on your own computer to get a feeling. Just checkout master branch, run cargo bench, then checkout your branch rerun cargo bench and see the difference immediate in your console. The regressions on my system were quite significant in the range of 50%-100%. Maybe you are right that it can be fixed with inline.

Will do!

... and exposing internal types such as Span ...

Sorry! I Exposed Span by accident when I marked the whole module as pub. Will remove the keyword from the declaration.

With respect to the suggested API I am always looking for the simplest design. Yet, I think having all this flexibility with Sliced, FixedVec and exposing internal types such as Span won't get us there. The ideal API was to just introduce a single generic parameter to the Backend trait with a default to String (or str depending on the design) and maybe some trait that defines the behavior that must be fulfilled by such a String type that can be used by the string-interner crate. That's at least how I imagine the simplest API for this feature.

Ah, so unify the requirements of all backends into a single trait? I could do that for most of the backends. However, buffer is another beast, because it requires handling specific invariants for its containers. I CAN make it require <S as ToOwned>::Owned: FixedContainer, which will make it able to pass str as a parameter, but I'd need to delete the safer FixedString for a direct implementation of the trait for String. I think that's okay, but I'd like to hear what you think.

If that's the plan, then the exposed types are a lot less; only FixedContainer and a new Internable trait encompassing all the requirements for all the backends. Is that okay? Also, should I add the <S as ToOwned>::Owned: FixedContainer bound for the new Internable trait? or should I bound it only in the BufferBackend? I think the separation would be good, because implementing Internable is safe, while implementing FixedContainer is inherently unsafe.

Feb 21 '22 19:02 jedel1043

On another note, I had to adjust the allocation benchmarks for the BucketBackend, because calling shrink_to_fit results in undefined behaviour. The cause was the definition of shrink_to_fit:

https://github.com/Robbepop/string-interner/blob/d552e760dae3fa1a2c3cb9a00c9232c218d0f58a/src/backend/bucket/mod.rs#L119-L123

Calling self.head.shrink_to_fit was reallocating head and invalidating previously created NonNulls, so I removed that call:

https://github.com/Robbepop/string-interner/blob/30f6186a69d8352ced63f7645b31f618f8666b81/src/backend/bucket/mod.rs#L152-L155

And obviously, fixing the bug increased the memory consumption, so I had to adjust the expected min and max memory used.

Feb 22 '22 18:02 jedel1043

@Robbepop did you have the chance to check the latest changes?

Mar 18 '22 09:03 Razican

@Robbepop did you have the chance to check the latest changes?

Sure but what does @jedel1043 think about the state of this PR?

If that's the plan, then the exposed types are a lot less; only FixedContainer and a new Internable trait encompassing all the requirements for all the backends. Is that okay?

I'd like for FixedContainer not to be exposed. What is the deal with it anyways? The Internable trait should actually be called String since I suppose that it is implemented by string types in order to fit into the scheme. I hope you can understand that I won't merge this if it introduces performance or memory usage regressions since I believe it is very well possible to not introduce those while making the StringInterner work for more string types.

On another note, I had to adjust the allocation benchmarks for the BucketBackend, because calling shrink_to_fit results in undefined behaviour. The cause was the definition of shrink_to_fit:

Can you elaborate on this? I don't see how this results in undefined behavior atm. The memory consumption regressions for the BucketBackend are extreme.

Mar 19 '22 08:03 Robbepop

Can you elaborate on this? I don't see how this results in undefined behavior atm. The memory consumption regressions for the BucketBackend are extreme.

Sure :) The definition of the BucketBackend is as follows:

https://github.com/Robbepop/string-interner/blob/d552e760dae3fa1a2c3cb9a00c9232c218d0f58a/src/backend/bucket/mod.rs#L55-L60

This backend works by creating several FixedString buffers of a fixed size, storing them in a Vec<String> and creating NonNull<str> references (abstracted by InternedString) to each of them in order to obtain the interned strs.

However, FixedString must always ensure that it won't be moved to another memory location, otherwise the previously created NonNull<str> references will point to unallocated memory (use after free).

Unfortunately, when we call the shrink_to_fit method from the backend, it internally calls the shrink_to_fit method from FixedString:

https://github.com/Robbepop/string-interner/blob/d552e760dae3fa1a2c3cb9a00c9232c218d0f58a/src/backend/bucket/fixed_str.rs#L60-L63

And calling shrink_to_fit on a String CAN move its contents to another memory location to reduce its allocated space. From the definition of Vec::shrink_to_fit:

    pub fn shrink_to_fit(&mut self) {
        // The capacity is never less than the length, and there's nothing to do when
        // they are equal, so we can avoid the panic case in `RawVec::shrink_to_fit`
        // by only calling it with a greater capacity.
        if self.capacity() > self.len {
            self.buf.shrink_to_fit(self.len);
        }
    }

This calls RawVec::shrink_to_fit:

    pub fn shrink_to_fit(&mut self, cap: usize) {
        handle_reserve(self.shrink(cap));
    }

Which calls RawVec::shrink:

    fn shrink(&mut self, cap: usize) -> Result<(), TryReserveError> {
        assert!(cap <= self.capacity(), "Tried to shrink to a larger capacity");


        let (ptr, layout) = if let Some(mem) = self.current_memory() { mem } else { return Ok(()) };


        let ptr = unsafe {
            // `Layout::array` cannot overflow here because it would have
            // overflowed earlier when capacity was larger.
            let new_layout = Layout::array::<T>(cap).unwrap_unchecked();
            self.alloc
                .shrink(ptr, layout, new_layout)
                .map_err(|_| AllocError { layout: new_layout, non_exhaustive: () })?
        };
        self.set_ptr_and_cap(ptr, cap);
        Ok(())
    }
}

Which calls Allocator::shrink, and in the implementation of shrink for Global, the default global allocator, in the branch corresponding to the case where new_layout.size() < old_layout.size():

            new_size => unsafe {
                let new_ptr = self.allocate(new_layout)?;
                ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_mut_ptr(), new_size);
                self.deallocate(ptr, old_layout);
                Ok(new_ptr)
            },

Oops, the old memory is deallocated! This causes a use after free and results in undefined behaviour.

Mar 19 '22 15:03 jedel1043

I'd like for FixedContainer not to be exposed.

I don't think that's possible. You cannot require that a generic type ought to be an implementor of a certain trait without exposing said trait. The alternative would be to maintain a different definition of each backend for each string type, which is not ideal.

The Internable trait should actually be called String since I suppose that it is implemented by string types in order to fit into the scheme.

I don't think is a good idea to name a trait as a specific type of the standard library. Aside from the name conflict, it would be very confusing if a trait, which is implemented by types, has the same name as a specific type. What about InternableString?

Mar 19 '22 16:03 jedel1043

@Robbepop Did you have the chance to read my explanation and comments? Also, if you're still unsure about this PR, I could reduce its complexity by completely removing the ability to use the interner with custom user types, making it only able to pass a certain list of types (mainly str, [T], CStr, Path, OStr).

Apr 27 '22 06:04 jedel1043

I think I would like to keep the StringInterner implementation as simple as possible given that it had some UB findings in the past and I want to avoid to repeat that. If we are able to find a generic implementation that is kind of simple enough I am happy to re-evaluate this.

May 01 '24 07:05 Robbepop