Speed up Data.ByteString.Short.unpack
Following the discussion from https://github.com/haskell/bytestring/issues/524#issuecomment-1173126905
Before:
$ cabal run -w ghc-9.2.3 --enable-profiling bytestring-bench --ghc-options='-fcheck-prim-bounds -fno-ignore-asserts' -- -p 'unpack' +RTS -s
Up to date
All
ShortByteString
ShortByteString unpack
unpack and look at first 100 elements: OK (0.87s)
55.6 ms ± 4.8 ms, 66 MB allocated, 242 B copied, 181 MB peak memory
unpackLast: OK (0.99s)
63.1 ms ± 5.2 ms, 66 MB allocated, 226 B copied, 181 MB peak memory
All 2 tests passed (1.96s)
2,246,950,896 bytes allocated in the heap
7,568 bytes copied during GC
71,372,544 bytes maximum residency (29 sample(s))
1,339,648 bytes maximum slop
181 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 43 colls, 0 par 1.132s 1.136s 0.0264s 0.0535s
Gen 1 29 colls, 0 par 0.319s 0.320s 0.0110s 0.0535s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.551s ( 0.554s elapsed)
GC time 1.451s ( 1.457s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 2.003s ( 2.011s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,075,230,737 bytes per MUT second
Productivity 27.5% of total user, 27.5% of total elapsed
After
All
ShortByteString
ShortByteString unpack
unpack and look at first 100 elements: OK (0.23s)
1.38 μs ± 103 ns, 6.9 KB allocated, 0 B copied, 47 MB peak memory
unpackLast: OK (0.15s)
22.6 ms ± 2.2 ms, 132 MB allocated, 436 B copied, 47 MB peak memory
All 2 tests passed (0.41s)
2,033,165,224 bytes allocated in the heap
7,984 bytes copied during GC
2,342,464 bytes maximum residency (21 sample(s))
257,800 bytes maximum slop
47 MiB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 55 colls, 0 par 0.003s 0.003s 0.0000s 0.0022s
Gen 1 21 colls, 0 par 0.025s 0.025s 0.0012s 0.0028s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.425s ( 0.428s elapsed)
GC time 0.028s ( 0.028s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.454s ( 0.456s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 4,780,369,992 bytes per MUT second
Productivity 93.8% of total user, 93.8% of total elapsed
I'm not sure what are the benefits of the old implementation. This one seems to be better in every regard and probably fuses better.
Git history reveals that Data.ByteString.Short.Internal.unpackAppend{Chars,Bytes}{Lazy,Strict} were added in 9e2814ff65a51d0684eb1b416fad8bf809685faf, copied from Data.ByteString.Internal.unpackAppend{Chars,Bytes}{Lazy,Strict}.
https://github.com/haskell/bytestring/blob/2b9416dd73121b5b2abd7a4594da249028858e4b/Data/ByteString/Internal.hs#L454-L457
https://github.com/haskell/bytestring/blob/2b9416dd73121b5b2abd7a4594da249028858e4b/Data/ByteString/Internal.hs#L494-L502
This is a reasonable strategy for ByteStrings, because indexing them involves accursedUnutterablePerformIO $ unsafeWithForeignPtr fp, which is a non-negligible cost if performed per each element. The same does not hold for ShortByteString, because indexWord8Array is as cheap as it gets. Thanks for noticing this, @hasufell!
My laptop might not be too reliable about performance tests, but I ran it twice and it seems that the strict foldr got consistently slower: https://gist.github.com/hasufell/33c0736e4fcdd1be4896c08322f191f1
Core:
-- RHS size: {terms: 36, types: 12, coercions: 0, joins: 0/2}
$wunpack :: ByteArray# -> [Word8]
$wunpack
= \ (ww :: ByteArray#) ->
let {
y :: Int#
y = -# (sizeofByteArray# ww) 1# } in
case ># 0# y of {
__DEFAULT ->
letrec {
go3 :: Int# -> [Word8]
go3
= \ (x :: Int#) ->
: (case indexWord8Array# ww x of wild { __DEFAULT -> W8# wild })
(case ==# x y of {
__DEFAULT -> go3 (+# x 1#);
1# -> []
}); } in
go3 0#;
1# -> []
}
-- RHS size: {terms: 6, types: 3, coercions: 0, joins: 0/0}
unpack :: ShortByteString -> [Word8]
unpack
= \ (w :: ShortByteString) -> case w of { SBS ww1 -> $wunpack ww1 }
-- RHS size: {terms: 30, types: 25, coercions: 0, joins: 1/1}
$wfoldr' :: forall {a}. (Word8 -> a -> a) -> a -> ByteArray# -> a
$wfoldr'
= \ (@a) (w :: Word8 -> a -> a) (w1 :: a) (ww :: ByteArray#) ->
joinrec {
go1 :: [Word8] -> (a -> a) -> a -> a
go1 (ds :: [Word8]) (eta :: a -> a) (eta1 :: a)
= case ds of {
[] -> eta eta1;
: y ys ->
jump go1
ys (\ (z :: a) -> case w y z of vx { __DEFAULT -> eta vx }) eta1
}; } in
jump go1 ($wunpack ww) id w1
-- RHS size: {terms: 11, types: 9, coercions: 0, joins: 0/0}
foldr' :: forall a. (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr'
= \ (@a)
(w :: Word8 -> a -> a)
(w1 :: a)
(w2 :: ShortByteString) ->
case w2 of { SBS ww1 -> $wfoldr' w w1 ww1 }
Could you change foldr' so it operates on the ByteArray# directly instead of consuming it via unpack?
I don't find it very surprising that a foldr' that consumes a list isn't efficient.
Could you change
foldr'so it operates on theByteArray#directly instead of consuming it viaunpack?I don't find it very surprising that a
foldr'that consumes a list isn't efficient.
But why is it faster with the previous version?
Comparing the Core between the old and new versions, both for foldr' itself and the foldr' benchmark might reveal the issue.
Here are the cores: https://gist.github.com/hasufell/653dea9cc5ad0f0d6cd35e47442fe35b
Even regular foldr is slower.
I was able to speed them up by writing explicit versions:
foldr :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr k v = \sbs ->
let l = length sbs
ba = asBA sbs
w = indexWord8Array ba
go !n | n >= l = v
| otherwise = k (w n) (go (n + 1))
in go 0
-- TODO: could write tail recursive right to left version
foldr' :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr' k v = \sbs ->
let l = length sbs
ba = asBA sbs
w = indexWord8Array ba
go !n | n >= l = v
| otherwise = let !acc = go (n + 1)
in k (w n) acc
in go 0
but doesn't that indicate that there's a problem with list fusion?
Ok, so lazy foldr can also be sped up by marking both unpack and unpackBytes with INLINE. That suggests that -fexpose-all-unfoldings is not really enough and that unpack doesn't fuse at all (which seems pretty bad).
That only leaves strict foldr'. No combination of inlining had improved this, so I guess we can go with the manual version here.
That leaves that question whether we have other functions that don't fuse/inline properly as well.
So, here are the benchmark results of the latest patchset, compared with the baseline: https://gist.github.com/hasufell/3710e86fcdfcd5f2c641e7d0a57e30ce
- folds (strict and lazy), filter, unfoldrN are faster
- raw unpack when just forcing the entire result is slower (kinda expected, since we have more thunks now?)
- unpack fuses much better now though (e.g.
last . unpackortake 120 . unpackare much faster) - added a lot of new
INLINEpragmas (they had an effect) - rest of the functions are mostly the same speed
[The existing implementation] is a reasonable strategy for
ByteStrings, because indexing them involvesaccursedUnutterablePerformIO $ unsafeWithForeignPtr fp, which is a non-negligible cost if performed per each element. The same does not hold forShortByteString, becauseindexWord8Arrayis as cheap as it gets. Thanks for noticing this, @hasufell!
The need to use accursedUnutterablePerformIO and unsafeWithForeignPtr does not impose any runtime cost on unsafeIndex. (unsafeWithForeignPtr might have a small cost in the current unpackBytes implementation if prevents the loop entry from being optimized to a tail call. But I haven't checked this.)
The performance characteristics of unpack for ByteString and ShortByteString should be very similar. What's good for one is very likely good for the other. The current implementation's goal is to maximize throughput when a large ByteString is unpacked and no list-fusion is possible. It tries to do this by allocating 99% fewer thunks than a fully lazy implementation would, while still allocating the result in small chunks so that a traversal can benefit from the machine's L1 cache.
If I'm reading the benchmark results @hasufell shared in the previous comment correctly, the proposed implementation takes about 75% longer in this case, which is a rather large regression. One alternative worth considering is adapting the existing fusion mechanism for Data.ByteString.unpack to the other unpack variants in bytestring.
Also, the benchmark results @hasufell has reported for 'unpack and look at a small number of leading elements' here and in #524 seem unbelievably large. Reading 100 bytes, allocating a hundred cons cells, and inspecting a few of them should take on the order of a microsecond, not on the order of ten milliseconds. (And it could be reasonably argued that even starting with 100 bytes is too much.) That's four orders of magnitude! I haven't tried reproducing or investigating yet, but something is very wrong somewhere.
@clyring when I use this implementation:
-- | /O(n)/. Convert a 'ShortByteString' into a list.
unpack :: ShortByteString -> [Word8]
unpack sbs = build (unpackFoldr sbs)
{-# INLINE unpack #-}
--
-- Have unpack fuse with good list consumers
--
unpackFoldr :: ShortByteString -> (Word8 -> a -> a) -> a -> a
unpackFoldr sbs k z = foldr k z sbs
{-# INLINE [0] unpackFoldr #-}
{-# RULES
"ShortByteString unpack-list" [1] forall bs .
unpackFoldr bs (:) [] = unpackBytes bs
#-}
with the existing 100 chunk unpackBytes implementation, I get these benchmark results compared to bytestring master:
$ cabal run bytestring-bench -- -p 'All.ShortByteString.ShortByteString conversions' --baseline out_master.csv
Up to date
All
ShortByteString
ShortByteString conversions
unpack
1: OK (0.24s)
18.6 ns ± 1.3 ns, 15% slower than baseline
2: OK (0.27s)
25.0 ns ± 1.4 ns, 17% slower than baseline
4: OK (0.24s)
41.6 ns ± 2.8 ns, 23% slower than baseline
8: OK (0.38s)
73.8 ns ± 4.8 ns, 24% slower than baseline
16: OK (0.20s)
138 ns ± 11 ns, 18% slower than baseline
32: OK (0.19s)
267 ns ± 26 ns, 14% slower than baseline
64: OK (0.34s)
551 ns ± 31 ns, 12% slower than baseline
128: OK (0.31s)
1.00 μs ± 77 ns
256: OK (0.28s)
1.92 μs ± 104 ns
512: OK (0.29s)
3.75 μs ± 232 ns
1024: OK (0.29s)
7.64 μs ± 365 ns
2048: OK (0.16s)
15.5 μs ± 1.4 μs
4096: OK (0.15s)
31.6 μs ± 2.6 μs
8192: OK (0.15s)
66.7 μs ± 5.4 μs
16384: OK (0.17s)
142 μs ± 14 μs
32768: OK (0.17s)
311 μs ± 24 μs
65536: OK (0.20s)
691 μs ± 67 μs
unpack and get last element: OK (0.36s)
2.58 ms ± 121 μs, 87% faster than baseline
unpack and get first 120 elements: OK (0.27s)
1.74 μs ± 95 ns, 99% faster than baseline
however... "unpack and get last element" from this PR is still magnitudes faster than the version above with unpackFoldr:
All
ShortByteString
ShortByteString conversions
unpack
1: OK (0.27s)
21.9 ns ± 1.5 ns, 30% slower than baseline
2: OK (0.34s)
34.3 ns ± 3.0 ns, 48% slower than baseline
4: OK (0.18s)
63.3 ns ± 5.6 ns, 68% slower than baseline
8: OK (0.30s)
125 ns ± 7.0 ns, 94% slower than baseline
16: OK (0.16s)
250 ns ± 21 ns, 98% slower than baseline
32: OK (0.16s)
466 ns ± 45 ns, 82% slower than baseline
64: OK (0.16s)
896 ns ± 84 ns, 66% slower than baseline
128: OK (0.26s)
1.78 μs ± 90 ns, 81% slower than baseline
256: OK (0.27s)
3.51 μs ± 271 ns, 82% slower than baseline
512: OK (0.27s)
7.10 μs ± 328 ns, 88% slower than baseline
1024: OK (0.51s)
14.1 μs ± 446 ns, 84% slower than baseline
2048: OK (0.15s)
28.7 μs ± 2.7 μs, 86% slower than baseline
4096: OK (0.25s)
57.5 μs ± 2.8 μs, 81% slower than baseline
8192: OK (0.15s)
119 μs ± 11 μs, 78% slower than baseline
16384: OK (0.26s)
245 μs ± 14 μs, 73% slower than baseline
32768: OK (0.16s)
541 μs ± 48 μs, 77% slower than baseline
65536: OK (0.16s)
1.18 ms ± 112 μs, 78% slower than baseline
unpack and get last element: OK (0.20s)
673 μs ± 42 μs, 74% faster than baseline
unpack and get first 120 elements: OK (0.16s)
1.88 μs ± 184 ns
Also, the benchmark results @hasufell has reported for 'unpack and look at a small number of leading elements' here and in #524 seem unbelievably large. Reading 100 bytes, allocating a hundred cons cells, and inspecting a few of them should take on the order of a microsecond, not on the order of ten milliseconds. (And it could be reasonably argued that even starting with 100 bytes is too much.) That's four orders of magnitude! I haven't tried reproducing or investigating yet, but something is very wrong somewhere.
This really terrible performance is caused by the bang on acc in Short.unpackAppendBytesStrict.go, which seems to have been mistakenly added by eaa31e02353b1a2de999d325d73070ce9bb4d001. This forces the entire result list to be built strictly from the end, rather than lazily from the start in chunks of 100 as was intended.
however... "unpack and get last element" from this PR is still magnitudes faster than the version above with unpackFoldr:
I looked into this; it stems a quirk of the current implementation of enumFromTo @Int that just happens to interact favorably with last. The base case of [x..y] corresponds to [y], so the last seen value from the previous iteration can be discarded in every branch of the fused loop. And this last seen value would otherwise be allocated as a (boxed) Word8 because it it is initially lastError.
In more typical situations, there is no advantage to peeling off the last iteration of the fused loop in this way. Indeed, it typically results mainly in the loop body being duplicated once. ~So I will raise a GHC issue and try to remove this quirky behavior upstream.~ EDIT: On second thought, this duplication seems unavoidable for the specific case of [minBound..maxBound], which is a bit unfortunate. Either way, I don't think it's worth worrying about whether this specific interaction happens in a bytestring benchmark.
Alright... so I changed the implementation to be similar to Data.ByteString, utilizing the GHC.Exts.build (unpackFoldr sbs) trick. Additionally I removed the bang pattern on the acc of unpackAppendBytesStrict.
The results look pretty good... only the first few tests seem to have regressed, but everything else is apparently faster:
Bench result
All
ShortByteString
Small payload
mempty: OK (0.46s)
22.2 ns ± 1.4 ns, 178% slower than baseline
UTF-8 String (naive): OK (0.37s)
778 ns ± 41 ns
String (naive): OK (0.37s)
777 ns ± 49 ns, 78% slower than baseline
intercalate
intercalate (large): OK (0.48s)
20.2 μs ± 1.1 μs, 137% slower than baseline
intercalate (small): OK (0.29s)
1.75 μs ± 170 ns, 93% slower than baseline
intercalate (tiny): OK (0.31s)
214 ns ± 21 ns, 96% slower than baseline
partition
strict
mostlyTrueFast: OK (0.23s)
248 μs ± 22 μs, 80% slower than baseline
mostlyFalseFast: OK (0.31s)
200 μs ± 11 μs, 67% slower than baseline
balancedFast: OK (0.29s)
382 μs ± 22 μs, 81% slower than baseline
mostlyTrueSlow: OK (0.26s)
6.57 ms ± 471 μs, 69% slower than baseline
mostlyFalseSlow: OK (0.25s)
6.19 ms ± 475 μs, 59% slower than baseline
balancedSlow: OK (0.26s)
6.44 ms ± 387 μs
folds
strict
foldl
1: OK (0.35s)
29.1 ns ± 2.8 ns, 55% faster than baseline
2: OK (0.34s)
31.4 ns ± 2.6 ns, 61% faster than baseline
4: OK (0.36s)
37.3 ns ± 2.6 ns, 73% faster than baseline
8: OK (0.40s)
44.8 ns ± 2.6 ns, 81% faster than baseline
16: OK (0.34s)
66.3 ns ± 5.2 ns, 86% faster than baseline
32: OK (0.37s)
149 ns ± 11 ns, 84% faster than baseline
64: OK (0.32s)
235 ns ± 21 ns, 87% faster than baseline
128: OK (0.41s)
399 ns ± 21 ns, 89% faster than baseline
256: OK (0.39s)
750 ns ± 41 ns, 89% faster than baseline
512: OK (0.37s)
1.45 μs ± 84 ns, 90% faster than baseline
1024: OK (0.36s)
2.84 μs ± 171 ns, 90% faster than baseline
2048: OK (0.32s)
5.63 μs ± 333 ns, 90% faster than baseline
4096: OK (0.34s)
11.2 μs ± 668 ns, 90% faster than baseline
8192: OK (0.29s)
22.3 μs ± 1.4 μs, 91% faster than baseline
16384: OK (0.31s)
44.5 μs ± 2.7 μs, 91% faster than baseline
32768: OK (0.30s)
89.1 μs ± 5.7 μs, 91% faster than baseline
65536: OK (0.29s)
177 μs ± 11 μs, 92% faster than baseline
foldl'
1: OK (0.36s)
30.3 ns ± 2.7 ns, 53% faster than baseline
2: OK (0.35s)
32.7 ns ± 2.9 ns, 60% faster than baseline
4: OK (0.37s)
37.7 ns ± 2.8 ns, 72% faster than baseline
8: OK (0.41s)
47.8 ns ± 2.8 ns, 79% faster than baseline
16: OK (0.35s)
71.2 ns ± 5.5 ns, 85% faster than baseline
32: OK (0.34s)
119 ns ± 11 ns, 87% faster than baseline
64: OK (0.44s)
214 ns ± 11 ns, 88% faster than baseline
128: OK (0.31s)
435 ns ± 42 ns, 88% faster than baseline
256: OK (0.41s)
817 ns ± 41 ns, 88% faster than baseline
512: OK (0.39s)
1.58 μs ± 83 ns, 89% faster than baseline
1024: OK (0.38s)
3.12 μs ± 169 ns, 89% faster than baseline
2048: OK (0.34s)
6.16 μs ± 345 ns, 89% faster than baseline
4096: OK (0.99s)
12.3 μs ± 166 ns, 89% faster than baseline
8192: OK (0.31s)
24.5 μs ± 1.4 μs, 90% faster than baseline
16384: OK (0.33s)
49.1 μs ± 2.6 μs, 90% faster than baseline
32768: OK (0.32s)
98.1 μs ± 5.6 μs, 90% faster than baseline
65536: OK (0.31s)
196 μs ± 11 μs, 91% faster than baseline
foldr
1: OK (0.37s)
33.0 ns ± 2.7 ns, 43% faster than baseline
2: OK (0.40s)
38.8 ns ± 2.6 ns, 50% faster than baseline
4: OK (0.45s)
50.6 ns ± 2.8 ns, 63% faster than baseline
8: OK (0.38s)
75.6 ns ± 5.5 ns, 68% faster than baseline
16: OK (0.35s)
126 ns ± 11 ns, 74% faster than baseline
32: OK (0.34s)
264 ns ± 25 ns, 76% faster than baseline
64: OK (0.32s)
492 ns ± 46 ns, 76% faster than baseline
128: OK (0.30s)
913 ns ± 90 ns, 77% faster than baseline
256: OK (0.38s)
1.71 μs ± 83 ns, 78% faster than baseline
512: OK (0.37s)
3.33 μs ± 169 ns, 79% faster than baseline
1024: OK (0.38s)
6.57 μs ± 339 ns, 79% faster than baseline
2048: OK (0.60s)
53.9 μs ± 4.9 μs, 48% faster than baseline
4096: OK (0.62s)
115 μs ± 11 μs, 45% faster than baseline
8192: OK (1.09s)
221 μs ± 20 μs, 49% faster than baseline
16384: OK (1.09s)
443 μs ± 44 μs, 50% faster than baseline
32768: OK (1.98s)
862 μs ± 41 μs, 53% faster than baseline
65536: OK (1.09s)
1.82 ms ± 175 μs, 52% faster than baseline
foldr'
1: OK (0.47s)
25.8 ns ± 1.4 ns, 66% faster than baseline
2: OK (0.48s)
27.5 ns ± 1.3 ns, 75% faster than baseline
4: OK (0.37s)
31.9 ns ± 2.9 ns, 83% faster than baseline
8: OK (0.43s)
44.4 ns ± 2.9 ns, 89% faster than baseline
16: OK (0.37s)
69.5 ns ± 5.3 ns, 91% faster than baseline
32: OK (0.34s)
116 ns ± 11 ns, 92% faster than baseline
64: OK (0.43s)
206 ns ± 10 ns, 93% faster than baseline
128: OK (0.43s)
415 ns ± 27 ns, 92% faster than baseline
256: OK (0.40s)
780 ns ± 41 ns, 93% faster than baseline
512: OK (0.38s)
1.51 μs ± 86 ns, 93% faster than baseline
1024: OK (0.35s)
2.97 μs ± 180 ns, 93% faster than baseline
2048: OK (0.36s)
5.90 μs ± 333 ns, 93% faster than baseline
4096: OK (0.35s)
11.7 μs ± 679 ns, 94% faster than baseline
8192: OK (0.29s)
23.4 μs ± 1.4 μs, 94% faster than baseline
16384: OK (0.32s)
46.8 μs ± 2.6 μs, 94% faster than baseline
32768: OK (0.31s)
93.6 μs ± 5.4 μs, 94% faster than baseline
65536: OK (0.30s)
187 μs ± 11 μs, 95% faster than baseline
foldr1'
1: OK (0.36s)
60.8 ns ± 5.4 ns, 35% faster than baseline
2: OK (0.44s)
100 ns ± 8.0 ns, 25% faster than baseline
4: OK (0.40s)
172 ns ± 11 ns, 24% faster than baseline
8: OK (0.35s)
318 ns ± 23 ns, 31% faster than baseline
16: OK (0.37s)
668 ns ± 41 ns, 25% faster than baseline
32: OK (0.33s)
1.23 μs ± 84 ns, 29% faster than baseline
64: OK (0.34s)
2.43 μs ± 180 ns, 30% faster than baseline
128: OK (0.32s)
4.79 μs ± 376 ns, 31% faster than baseline
256: OK (0.31s)
9.60 μs ± 726 ns, 31% faster than baseline
512: OK (0.30s)
19.1 μs ± 1.5 μs, 31% faster than baseline
1024: OK (0.29s)
38.1 μs ± 3.1 μs, 32% faster than baseline
2048: OK (0.28s)
76.1 μs ± 5.8 μs, 32% faster than baseline
4096: OK (0.27s)
153 μs ± 11 μs, 33% faster than baseline
8192: OK (0.26s)
304 μs ± 24 μs, 35% faster than baseline
16384: OK (0.25s)
604 μs ± 51 μs, 36% faster than baseline
32768: OK (0.24s)
1.20 ms ± 88 μs, 41% faster than baseline
65536: OK (0.22s)
2.40 ms ± 184 μs, 46% faster than baseline
unfoldrN
1: OK (0.40s)
76.4 ns ± 5.6 ns, 40% faster than baseline
2: OK (0.40s)
78.8 ns ± 5.7 ns, 53% faster than baseline
4: OK (0.41s)
83.3 ns ± 5.5 ns, 65% faster than baseline
8: OK (0.43s)
93.4 ns ± 5.7 ns, 80% faster than baseline
16: OK (0.34s)
114 ns ± 11 ns, 85% faster than baseline
32: OK (0.39s)
154 ns ± 11 ns, 89% faster than baseline
64: OK (0.35s)
259 ns ± 21 ns, 90% faster than baseline
128: OK (0.31s)
422 ns ± 41 ns, 92% faster than baseline
256: OK (0.38s)
747 ns ± 41 ns, 93% faster than baseline
512: OK (0.35s)
1.40 μs ± 82 ns, 93% faster than baseline
1024: OK (0.36s)
2.71 μs ± 169 ns, 93% faster than baseline
2048: OK (0.35s)
5.29 μs ± 328 ns, 93% faster than baseline
4096: OK (3.41s)
11.2 μs ± 885 ns, 93% faster than baseline
8192: OK (3.23s)
21.9 μs ± 953 ns, 93% faster than baseline
16384: OK (1.70s)
44.3 μs ± 3.7 μs, 93% faster than baseline
32768: OK (1.67s)
87.8 μs ± 7.2 μs, 93% faster than baseline
65536: OK (1.66s)
174 μs ± 13 μs, 93% faster than baseline
filter
1: OK (0.41s)
79.6 ns ± 5.4 ns, 27% faster than baseline
2: OK (0.41s)
83.2 ns ± 5.1 ns, 43% faster than baseline
4: OK (0.43s)
89.0 ns ± 5.2 ns, 57% faster than baseline
8: OK (0.45s)
97.7 ns ± 5.9 ns, 71% faster than baseline
16: OK (0.56s)
148 ns ± 8.3 ns, 77% faster than baseline
32: OK (0.38s)
148 ns ± 11 ns, 87% faster than baseline
64: OK (0.33s)
259 ns ± 22 ns, 88% faster than baseline
128: OK (0.43s)
395 ns ± 21 ns, 91% faster than baseline
256: OK (0.36s)
690 ns ± 42 ns, 92% faster than baseline
512: OK (0.34s)
1.27 μs ± 89 ns, 92% faster than baseline
1024: OK (0.35s)
2.46 μs ± 171 ns, 92% faster than baseline
2048: OK (0.33s)
4.82 μs ± 328 ns, 92% faster than baseline
4096: OK (1.67s)
10.5 μs ± 961 ns, 92% faster than baseline
8192: OK (1.69s)
21.0 μs ± 1.6 μs, 92% faster than baseline
16384: OK (0.53s)
43.5 μs ± 2.1 μs, 92% faster than baseline
32768: OK (0.86s)
82.7 μs ± 4.1 μs, 92% faster than baseline
65536: OK (1.46s)
154 μs ± 2.7 μs, 93% faster than baseline
findIndexOrLength
takeWhile: OK (0.36s)
25.4 μs ± 1.4 μs, 59% slower than baseline
dropWhile: OK (0.28s)
16.0 μs ± 1.3 μs
break: OK (0.36s)
25.5 μs ± 1.3 μs
findIndex_
findIndices: OK (0.26s)
30.3 μs ± 2.6 μs, 28% faster than baseline
find: OK (0.33s)
468 ns ± 43 ns, 91% faster than baseline
traversals
map (+1) large: OK (0.29s)
3.44 ms ± 179 μs
map (+1) small: OK (0.42s)
754 ns ± 42 ns
ShortByteString strict first index
FindIndices: OK (0.34s)
483 ns ± 41 ns
ElemIndices: OK (0.33s)
482 ns ± 42 ns
FindIndex: OK (0.36s)
578 ns ± 42 ns, 47% slower than baseline
ElemIndex: OK (0.38s)
67.0 ns ± 5.1 ns
ShortByteString strict second index
FindIndices: OK (0.35s)
1.21 μs ± 86 ns
ElemIndices: OK (0.35s)
1.21 μs ± 82 ns
FindIndex: OK (1.90s)
170 μs ± 15 μs
ElemIndex: OK (1.88s)
168 μs ± 16 μs
ShortByteString index equality inlining
FindIndices/inlined: OK (0.24s)
2.55 ms ± 179 μs
FindIndices/non-inlined: OK (0.30s)
16.5 ms ± 774 μs
FindIndex/inlined: OK (0.40s)
387 ns ± 21 ns
FindIndex/non-inlined: OK (0.29s)
3.65 μs ± 340 ns
ShortByteString conversions
unpack
1: OK (0.37s)
62.0 ns ± 5.7 ns
2: OK (0.40s)
73.6 ns ± 5.5 ns
4: OK (0.43s)
100 ns ± 5.8 ns, 19% faster than baseline
8: OK (0.40s)
160 ns ± 12 ns, 25% faster than baseline
16: OK (0.36s)
307 ns ± 23 ns, 27% faster than baseline
32: OK (0.37s)
617 ns ± 41 ns, 30% faster than baseline
64: OK (0.38s)
1.31 μs ± 93 ns, 30% faster than baseline
128: OK (0.36s)
2.62 μs ± 235 ns, 27% faster than baseline
256: OK (0.35s)
5.25 μs ± 339 ns, 26% faster than baseline
512: OK (0.32s)
10.4 μs ± 702 ns, 26% faster than baseline
1024: OK (0.31s)
20.8 μs ± 1.3 μs, 27% faster than baseline
2048: OK (0.30s)
41.7 μs ± 2.7 μs, 29% faster than baseline
4096: OK (0.30s)
84.1 μs ± 5.3 μs, 30% faster than baseline
8192: OK (0.29s)
171 μs ± 12 μs, 30% faster than baseline
16384: OK (0.28s)
345 μs ± 24 μs, 33% faster than baseline
32768: OK (0.47s)
711 μs ± 31 μs, 34% faster than baseline
65536: OK (0.46s)
1.50 ms ± 75 μs, 34% faster than baseline
pack
1: OK (0.37s)
69.2 ns ± 6.4 ns
2: OK (0.41s)
79.9 ns ± 5.2 ns
4: OK (0.46s)
103 ns ± 6.0 ns
8: OK (0.40s)
158 ns ± 11 ns
16: OK (0.37s)
280 ns ± 21 ns
32: OK (0.35s)
545 ns ± 42 ns
64: OK (0.35s)
1.16 μs ± 86 ns
128: OK (0.32s)
2.32 μs ± 169 ns
256: OK (0.33s)
4.60 μs ± 330 ns
512: OK (0.29s)
9.15 μs ± 688 ns
1024: OK (0.32s)
20.7 μs ± 1.3 μs
2048: OK (0.33s)
42.5 μs ± 3.5 μs
4096: OK (0.31s)
86.3 μs ± 6.1 μs
8192: OK (0.32s)
185 μs ± 11 μs
16384: OK (0.32s)
379 μs ± 34 μs
32768: OK (0.18s)
750 μs ± 68 μs
65536: OK (0.28s)
1.50 ms ± 135 μs
unpack and get last element: OK (0.27s)
6.68 ms ± 400 μs, 93% faster than baseline
unpack and get first 120 elements: OK (0.36s)
5.41 μs ± 376 ns, 99% faster than baseline
All 186 tests passed (88.73s)
However... it's still (a little) slower for the last 2 benchmarks than my original unpack that uses
unpack sbs = let ix = length sbs - 1
in List.map (unsafeIndex sbs) [0..ix]
{-# INLINE unpack #-}
which runs
unpack and get last element: OK (0.36s)
2.58 ms ± 121 μs
unpack and get first 120 elements: OK (0.27s)
1.74 μs ± 95 ns
vs the latest implementation:
unpack and get last element: OK (0.27s)
6.68 ms ± 400 μs
unpack and get first 120 elements: OK (0.36s)
5.41 μs ± 376 ns
But that's quite an improvement.
Drone CI seems broken