Blake3 implementation in std is slow
Zig Version
0.10.1
Steps to Reproduce and Observed Behavior
See https://github.com/matklad/benchmarks/tree/1caed4cfdd2285f2f1946f592fc1d492fd9ed836/blake3 for a reproducible Rust vs Zig benchmark. The representative Result is
Rust
time = 1.62s
MiB/s = 3859
Zig + ReleaseFast
time = 6.074s
MiB/S = 1028
(this is pure Rust impl, assembly/C impls is a little bit further faster than that)
Zig results are approximately the same between 0.10.1 and 0.11.0-dev.2680+a1aa55ebe
This is relatively important for us at TigerBeetle.
Expected Behavior
Be as fast as Rust. Ideally, be as fast as asm.
The current BLAKE3 implementation is just a direct port of the reference implementation. It's definitely slow.
As an alternative, Aegis MAC that we already have in stdlib is super fast on any platform with AES acceleration.
Also, KangarooTwelve is being standardized, is as fast as BLAKE3 or even faster on some platforms, and is very likely to be chosen over BLAKE3 in modern protocols. I expect it to be in the stdlib soon, with great performance.
I had compared Blake3 to Blake2b512 and Sha256a a while back with the following results:
// Blake3 : 1
// Blake2b512 : 1.3 x slower than Blake3
// Sha256 : 4 x slower than Blake3
Good to know we can improve to go about 4x faster than that even!
Currently:
Apple M1:
| BLAKE3 | AEGIS-128L MAC | AEGIS-128X MAC |
|---|---|---|
| 1.56 GB/s | 15.6 GB/s | 14.5 GB/s |
AMD EPYC (Zen 2)
| BLAKE3 | AEGIS-128L MAC | AEGIS-128X MAC |
|---|---|---|
| 4.90 GB/s | 20.7 GB/s | 31.9 GB/s |
- BLAKE3: Rust crate, with assembly
- AEGIS-128L from the Zig standard library (
std.crypto.auth.aegis.Aegis128LMac) -
AEGIS-128X (
Aegis128XMac)
More recent benchmarks:
The rust implementation seems to more or less just import the c implementation, you could do that too in zig.
I wrote an Blake3 implementation is Zig. Here are the benchmark results (hyperfine):
Start file:
const std = @import("std");
const config = @import("config");
const blake3 = @import("./root.zig");
const c = @cImport({
@cInclude("blake3.h");
});
const stderr = std.io.getStdErr();
const stdout = std.io.getStdOut();
fn run() !void {
if (std.os.argv.len != 2) {
return error.OneArgRequired;
}
const fd = try std.posix.openZ(std.os.argv[1], .{}, undefined);
defer std.posix.close(fd);
const stat = try std.posix.fstat(fd);
const area = try std.posix.mmap(null, @intCast(stat.size), 1, .{ .TYPE = .PRIVATE }, fd, 0);
defer std.posix.munmap(area);
var out: [32]u8 = undefined;
if (config.c) {
var hasher: c.blake3_hasher = undefined;
c.blake3_hasher_init(&hasher);
c.blake3_hasher_update(&hasher, area.ptr, area.len);
c.blake3_hasher_finalize(&hasher, &out, out.len);
} else if (config.std) {
std.crypto.hash.Blake3.hash(area, &out, .{});
} else {
blake3.Blake3(.{}).hash(area, &out);
}
try stdout.writer().print("{s}\n", .{std.fmt.bytesToHex(out, .lower)});
}
pub fn main() u8 {
run() catch |err| {
stderr.writer().print("Error: {}\n", .{err}) catch unreachable;
return 1;
};
return 0;
}
c-blake3 uses config.c = true and config.std = false and compiles blake3.c, blake3_dispatch.c, blake3_portable.c, blake3_sse2.c, blake3_sse41.c and blake3_avx2.c.
c-asm-blake3 uses config.c = true and config.std = false and compiles blake3.c, blake3_dispatch.c, blake3_portable.c, blake3_sse2_x86-64_unix.S, blake3_sse41_x86-64_unix.S and blake3_avx2_x86-64_unix.S.
zig-std-blake3 uses config.c = false and config.std = true.
zig-blake3 uses config.c = false and config.std = false (my impl).
Benchmark 1: b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
Time (mean ± σ): 238.3 ms ± 7.2 ms [User: 213.4 ms, System: 22.6 ms]
Range (min … max): 230.7 ms … 265.2 ms 500 runs
Benchmark 2: ./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 229.7 ms ± 6.5 ms [User: 207.3 ms, System: 22.1 ms]
Range (min … max): 223.9 ms … 268.0 ms 500 runs
Benchmark 3: ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 268.3 ms ± 27.2 ms [User: 245.8 ms, System: 22.2 ms]
Range (min … max): 246.2 ms … 362.6 ms 500 runs
Benchmark 4: ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 240.3 ms ± 2.6 ms [User: 217.4 ms, System: 22.6 ms]
Range (min … max): 238.7 ms … 287.0 ms 500 runs
Benchmark 5: ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 925.5 ms ± 13.7 ms [User: 903.9 ms, System: 21.1 ms]
Range (min … max): 918.1 ms … 1064.6 ms 500 runs
Summary
./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso ran
1.04 ± 0.04 times faster than b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
1.05 ± 0.03 times faster than ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
1.17 ± 0.12 times faster than ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
4.03 ± 0.13 times faster than ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
CPU: Intel Core 12700K
I will open a PR soon.
@rpkak How did it go? I can't find the code or a PR
@rpkak How did it go? I can't find the code or a PR
The reason why I have not created a PR until now is that my current implementation produces too much binary code. Also for small inputs my implementation is sometimes a bit slower than the std implementation.
Due to limited time, I am not currently working on this, but I will do this at some point in time.
Is the code public somewhere? I could work on it
Addressed by #25574 and #25587
We now also have #25593 as an alternative.