Make string interner and backends compatible with more types
This PR rewrites the interner to be generic, making it compatible with many more types.
- ~~Now the interner explicitly needs a backend parameter, because the type checker cannot resolve
&stras the input argument tostrbased backends. (There are more than oneimpl AsReffor the&strtype). One solution would be to restrict T toBorrow, with the downside of making it less ergonomic to pass a&[String]to theextendmethod of the interner. I'd like to hear your ideas :)~~ - Had to add a couple of custom trait bounds, one of them being unsafe by the nature of the
BucketBackend. - Some bounds are overly restrictive, like using
<S as ToOwned>::Ownedas the buffer type of theStringBackend, but that made the API a bit more cleaner with less generic parameters. Would it be preferrable as it is or do you prefer more general types?
Comments are very much appreciated :)
Hi and thanks for your PR! Looks interesting! Usually for bigger changes it is nice to have an issue up before a PR emerges to allow for discussion beforehand.
I have yet to grasp what it actually provides users of this library in essence. So will it be possible to use C-style strings or OsString with StringInterner or is this even more generic? Some motivation where you describe the state of the string-interner crate, its limitations and problems and what this PR fixes (and evtl. how) would be very much approeciated.
I also conducted the compiled cargo doc of this PR and saw some new types. Most notably Span in backend::string and backend::bucket and the FixedString and FixedVec in backend::bucket. Would be nice if you could describe why this PR requires us to expand the API of the crate.
I also conducted some benchmarks using cargo bench and unfortunately found that this PR worsens the performance of the bucket backend and string backend significantly in some cases.
Thank you for the quick response!
Usually for bigger changes it is nice to have an issue up before a PR emerges to allow for discussion beforehand.
Noted for the next time!
I have yet to grasp what it actually provides users of this library in essence. So will it be possible to use C-style strings or
OsStringwithStringInterneror is this even more generic? Some motivation where you describe the state of thestring-internercrate, its limitations and problems and what this PR fixes (and evtl. how) would be very much appreciated.
In Boa, a Javascript interpreter, we use string-interner to intern string literals embedded in the code. However, we're transitioning to an UTF-16 based string and this interner is currently unable to intern arbitrary [T] slices. (Specifically we need support for [u16] slices)
I thought of just providing a [u16] implementation PR and calling it a day, but since I saw that in #39 you had a discussion about allowing another string type, I just thought that it would be ideal to make the interner and the backends generic to any custom string.
This PR provides full capabilities for using the interner with any custom type, provided that the user implements the correct traits for its custom type. This also makes it very easy to extend the interner to be compatible with OsStr, CStr and some others, you just implement the required traits and that's it!.
I also conducted the compiled
cargo docof this PR and saw some new types. Most notablySpaninbackend::stringandbackend::bucketand theFixedStringandFixedVecinbackend::bucket. Would be nice if you could describe why this PR requires us to expand the API of the crate.
IIRC Span was already in backend::String, I just reordered a bit the exports because of the new trait associated with the BufferBackend, and to have symmetry with each backend.
The most notable additions to the API are FixedContainer, FixedVec and Sliced.
FixedContainer is a generalization of FixedString, a container fixed in capacity that doesn't reallocate on each push. Had to make it unsafe because there's currently no way to know whether a reallocation of the container occurred, so it must be responsibility of the user to guarantee that invariant.
FixedString already existed, just exposed it to be explicitly passable as a generic parameter for BufferBackend.
FixedVec is the one that allows the user to utilize the backend with any [T] slice, you just need to provide the type of the slice. It's pretty similar to FixedString, but without the cast from [u8] to str. Of course, the user could have implemented FixedVec by its own, but I think providing common containers for some std types would be better for ergonomics.
Also, I realized the trait could be implemented for the raw Vec and String types, but I left it as it is until I heard your thoughts about it.
Finally, Sliced is just a trait to represent a type T that is bidirectionally convertible to a [U] slice. This bound was needed to generalize the StringBackend and the BufferBackend. It isn't marked as unsafe because the implementor can assume that the provided input &[U] is a valid representation of T, since the backends only convert full strings between each representation. Implemented it for str and [T] and should be very easy to implement for more std string types.
I also conducted some benchmarks using
cargo benchand unfortunately found that this PR worsens the performance of thebucketbackend andstringbackend significantly in some cases.
That's... weird. Maybe the generics are adding additional indirection to the trait methods? Probably inlining should fix that. May I know specifically which benchmarks have regressed?
Thanks for your elaborate explanations!
That's... weird. Maybe the generics are adding additional indirection to the trait methods? Probably inlining should fix that. May I know specifically which benchmarks have regressed?
I think it is best if you just run the benchmarks on your own computer to get a feeling. Just checkout master branch, run cargo bench, then checkout your branch rerun cargo bench and see the difference immediate in your console.
The regressions on my system were quite significant in the range of 50%-100%. Maybe you are right that it can be fixed with inline.
With respect to the suggested API I am always looking for the simplest design. Yet, I think having all this flexibility with Sliced, FixedVec and exposing internal types such as Span won't get us there. The ideal API was to just introduce a single generic parameter to the Backend trait with a default to String (or str depending on the design) and maybe some trait that defines the behavior that must be fulfilled by such a String type that can be used by the string-interner crate. That's at least how I imagine the simplest API for this feature.
I think it is best if you just run the benchmarks on your own computer to get a feeling. Just checkout
masterbranch, runcargo bench, then checkout your branch reruncargo benchand see the difference immediate in your console. The regressions on my system were quite significant in the range of 50%-100%. Maybe you are right that it can be fixed withinline.
Will do!
... and exposing internal types such as
Span...
Sorry! I Exposed Span by accident when I marked the whole module as pub. Will remove the keyword from the declaration.
With respect to the suggested API I am always looking for the simplest design. Yet, I think having all this flexibility with
Sliced,FixedVecand exposing internal types such asSpanwon't get us there. The ideal API was to just introduce a single generic parameter to theBackendtrait with a default toString(orstrdepending on the design) and maybe some trait that defines the behavior that must be fulfilled by such aStringtype that can be used by thestring-internercrate. That's at least how I imagine the simplest API for this feature.
Ah, so unify the requirements of all backends into a single trait? I could do that for most of the backends.
However, buffer is another beast, because it requires handling specific invariants for its containers. I CAN make it require <S as ToOwned>::Owned: FixedContainer, which will make it able to pass str as a parameter, but I'd need to delete the safer FixedString for a direct implementation of the trait for String. I think that's okay, but I'd like to hear what you think.
If that's the plan, then the exposed types are a lot less; only FixedContainer and a new Internable trait encompassing all the requirements for all the backends. Is that okay?
Also, should I add the <S as ToOwned>::Owned: FixedContainer bound for the new Internable trait? or should I bound it only in the BufferBackend? I think the separation would be good, because implementing Internable is safe, while implementing FixedContainer is inherently unsafe.
On another note, I had to adjust the allocation benchmarks for the BucketBackend, because calling shrink_to_fit results in undefined behaviour. The cause was the definition of shrink_to_fit:
https://github.com/Robbepop/string-interner/blob/d552e760dae3fa1a2c3cb9a00c9232c218d0f58a/src/backend/bucket/mod.rs#L119-L123
Calling self.head.shrink_to_fit was reallocating head and invalidating previously created NonNulls, so I removed that call:
https://github.com/Robbepop/string-interner/blob/30f6186a69d8352ced63f7645b31f618f8666b81/src/backend/bucket/mod.rs#L152-L155
And obviously, fixing the bug increased the memory consumption, so I had to adjust the expected min and max memory used.
@Robbepop did you have the chance to check the latest changes?
@Robbepop did you have the chance to check the latest changes?
Sure but what does @jedel1043 think about the state of this PR?
If that's the plan, then the exposed types are a lot less; only FixedContainer and a new Internable trait encompassing all the requirements for all the backends. Is that okay?
I'd like for FixedContainer not to be exposed. What is the deal with it anyways?
The Internable trait should actually be called String since I suppose that it is implemented by string types in order to fit into the scheme.
I hope you can understand that I won't merge this if it introduces performance or memory usage regressions since I believe it is very well possible to not introduce those while making the StringInterner work for more string types.
On another note, I had to adjust the allocation benchmarks for the BucketBackend, because calling shrink_to_fit results in undefined behaviour. The cause was the definition of shrink_to_fit:
Can you elaborate on this? I don't see how this results in undefined behavior atm. The memory consumption regressions for the BucketBackend are extreme.
Can you elaborate on this? I don't see how this results in undefined behavior atm. The memory consumption regressions for the
BucketBackendare extreme.
Sure :)
The definition of the BucketBackend is as follows:
https://github.com/Robbepop/string-interner/blob/d552e760dae3fa1a2c3cb9a00c9232c218d0f58a/src/backend/bucket/mod.rs#L55-L60
This backend works by creating several FixedString buffers of a fixed size, storing them in a Vec<String> and creating NonNull<str> references (abstracted by InternedString) to each of them in order to obtain the interned strs.
However, FixedString must always ensure that it won't be moved to another memory location, otherwise the previously created NonNull<str> references will point to unallocated memory (use after free).
Unfortunately, when we call the shrink_to_fit method from the backend, it internally calls the shrink_to_fit method from FixedString:
https://github.com/Robbepop/string-interner/blob/d552e760dae3fa1a2c3cb9a00c9232c218d0f58a/src/backend/bucket/fixed_str.rs#L60-L63
And calling shrink_to_fit on a String CAN move its contents to another memory location to reduce its allocated space. From the definition of Vec::shrink_to_fit:
pub fn shrink_to_fit(&mut self) {
// The capacity is never less than the length, and there's nothing to do when
// they are equal, so we can avoid the panic case in `RawVec::shrink_to_fit`
// by only calling it with a greater capacity.
if self.capacity() > self.len {
self.buf.shrink_to_fit(self.len);
}
}
This calls RawVec::shrink_to_fit:
pub fn shrink_to_fit(&mut self, cap: usize) {
handle_reserve(self.shrink(cap));
}
Which calls RawVec::shrink:
fn shrink(&mut self, cap: usize) -> Result<(), TryReserveError> {
assert!(cap <= self.capacity(), "Tried to shrink to a larger capacity");
let (ptr, layout) = if let Some(mem) = self.current_memory() { mem } else { return Ok(()) };
let ptr = unsafe {
// `Layout::array` cannot overflow here because it would have
// overflowed earlier when capacity was larger.
let new_layout = Layout::array::<T>(cap).unwrap_unchecked();
self.alloc
.shrink(ptr, layout, new_layout)
.map_err(|_| AllocError { layout: new_layout, non_exhaustive: () })?
};
self.set_ptr_and_cap(ptr, cap);
Ok(())
}
}
Which calls Allocator::shrink, and in the implementation of shrink for Global, the default global allocator, in the branch corresponding to the case where new_layout.size() < old_layout.size():
new_size => unsafe {
let new_ptr = self.allocate(new_layout)?;
ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_mut_ptr(), new_size);
self.deallocate(ptr, old_layout);
Ok(new_ptr)
},
Oops, the old memory is deallocated! This causes a use after free and results in undefined behaviour.
I'd like for
FixedContainernot to be exposed.
I don't think that's possible. You cannot require that a generic type ought to be an implementor of a certain trait without exposing said trait. The alternative would be to maintain a different definition of each backend for each string type, which is not ideal.
The
Internabletrait should actually be calledStringsince I suppose that it is implemented by string types in order to fit into the scheme.
I don't think is a good idea to name a trait as a specific type of the standard library. Aside from the name conflict, it would be very confusing if a trait, which is implemented by types, has the same name as a specific type. What about InternableString?
@Robbepop Did you have the chance to read my explanation and comments? Also, if you're still unsure about this PR, I could reduce its complexity by completely removing the ability to use the interner with custom user types, making it only able to pass a certain list of types (mainly str, [T], CStr, Path, OStr).
I think I would like to keep the StringInterner implementation as simple as possible given that it had some UB findings in the past and I want to avoid to repeat that. If we are able to find a generic implementation that is kind of simple enough I am happy to re-evaluate this.