bytestring append is a bit inefficient when bytestrings are short

append unconditionally uses FFI calls to memcpy for both bytestrings. If one or both of them are known at compile time to be very short, this is inefficient. I believe there's a very simple solution: just use GHC primops to do the copying instead of FFI. When there's a known short length to copy, instructions will be emitted to copy directly; otherwise the compiler will use an FFI call. No need to do this so manually.

Jan 07 '22 22:01 treeowl

Thanks for the tip! Which is the primop that we should use?

Jan 07 '22 23:01 sjakobi

copyAddrToByteArray# looks like the right one to me. Note that this is only optimized when the number of bytes to be copied ends up being a CMM literal. I don't know just what it takes to make sure that happens, but it's a much better bet than not even trying, and allows any potential future improvements in GHC's implementation to help further.

Jan 07 '22 23:01 treeowl

I'm not sure how copyAddrToByteArray# would help. There is no immediate ByteArray here.

Jan 07 '22 23:01 Bodigrim

@Bodigrim It requires a bit of extra work, pulling in GHC.ForeignPtr. Instead of using the stock mallocPlainForeignPtrBytes, we can build the ForeignPtr ourselves, filling a ByteArray# and then wrapping it up. Rough sketch:

blargh :: Int -- ^ size of string to create
  -> (MBA RealWorld -> IO ())  -- ^ how to mutate it
  -> IO (ForeignPtr a)
blargh (I# size) builder = IO $ \s ->
    case newPinnedByteArray# size s     of { (# s', mbarr# #) ->
    case unIO (builder (MBA mbarr#)) s' of { (# s'', ~() #) ->
       (# s'', ForeignPtr (byteArrayContents# (unsafeCoerce# mbarr#))
                         (PlainPtr mbarr#) #)
     }}
{-# INLINE blargh #-}

The argument to blargh is the copying code, which has access to the MutableByteArray#.

Jan 07 '22 23:01 treeowl

Yeah, but I'm not convinced that an overhead for allocating redundant ByteArray header does not absorb potential savings from unrolled copyAddrToByteArray#.

Jan 07 '22 23:01 Bodigrim

@Bodigrim what redundant ByteArray header? The MBA wrapper will inline away. If you don't trust that, you can use MutableByteArray# RealWorld -> IO () instead, but I'm pretty sure that's not necessary.

Jan 07 '22 23:01 treeowl

The point is not the exact details; it's that if we crack open GHC.ForeignPtr we can get our hands on the MutableByteArray# before it's wrapped in its ForeignPtr and have our way with it.

Side note: it is odd that I can't find a copyAddrToAddr# primop. Does it go by another name, or does it really not exist?

Jan 07 '22 23:01 treeowl

Sorry, I meant ByteArray# header, which is memory until byteArrayContents#: header and array size, already 16 bytes.

Jan 08 '22 00:01 Bodigrim

OK, but that's not extra. We get that already by calling mallocPlainForeignPtrBytes! My blargh is just mallocPlainForeignPtrBytes with a callback. I also removed a safety check, but that probably should actually remain in case of enormous arguments whose sum overflows.

Jan 08 '22 00:01 treeowl

I think the best-case-scenario with respect to this issue would be for bytestring to eventually switch from using memcpy directly to using copyBytes and for the latter to have improved handling of known-small-copies, perhaps via the addition of a new primop. Please feel free to request this functionality upstream!

If/when this situation improves, it may be of value to modify unsafeCreate to make the length of the ByteStrings it generates available at compile-time and not obscured by the unsafeDupablePerformIO.

Also, the discussion about the MutableByteArray# headers above has brought to my attention that the memory use numbers in the Haddocks for Data.ByteString.Short weren't updated with the removal of the offset parameter and we are now at 4+2+2=8 words of overhead for an unshared StrictByteString.

Jan 09 '22 01:01 clyring

@clyring , I think it's better to do what I did anyway, since it doesn't need keepAlive# and since we need to hang on to that MutableByteArray# anyway to make unboxing work out. There's no obvious benefit to using Addr# here. We definitely do want primop support for Addr# to Addr# copying, but I don't know that bytestring can really win anything by using it.

Jan 09 '22 01:01 treeowl

There is an obvious benefit to not to fiddling with the MutableByteArray#s ourselves: shorter, simpler code. The keepAlive# matter is a separate issue (#461) and is mostly orthogonal to this one. Or do I misunderstand you?

Jan 09 '22 02:01 clyring

@clyring, see #466 to expose lots of good info from unsafeCreate (you'll have to compile with GHC 9.2 or so to get the good effects, thanks to keepAlive# and so on). I'm also starting to think we should probably offer MutableArray#-based versions of create, unsafeCreate, etc., since it's so much easier to avoid even having to think about liveness in that context. For reliable unboxing, I think we'll want something like the PlainByteString type I use there.

Jan 09 '22 02:01 treeowl

@clyring I think it's okay to fiddle with MutableByteArray#s ourselves, as long as we can keep them reasonably well contained. I think #466 does a pretty good job of starting that containment; I'm not sure how far we can go. I think we want to minimize boxing to the extent possible. Given f (g x), we really don't want f to build a ForeignPtrContents ~~, a ForeignPtr,~~ and a ByteString only to have those ~~three~~ two (!) heap objects immediately torn down by g. That's just lousy, and we can avoid an awful lot of it with some care. The big question is how pretty we can/can't make the building blocks for that.

Jan 09 '22 02:01 treeowl

Fixed by #591.

Jun 16 '23 15:06 clyring