zig icon indicating copy to clipboard operation
zig copied to clipboard

Blake3 implementation in std is slow

Open matklad opened this issue 2 years ago • 6 comments

Zig Version

0.10.1

Steps to Reproduce and Observed Behavior

See https://github.com/matklad/benchmarks/tree/1caed4cfdd2285f2f1946f592fc1d492fd9ed836/blake3 for a reproducible Rust vs Zig benchmark. The representative Result is

Rust
time  = 1.62s
MiB/s = 3859

Zig + ReleaseFast
time  = 6.074s
MiB/S = 1028

(this is pure Rust impl, assembly/C impls is a little bit further faster than that)

Zig results are approximately the same between 0.10.1 and 0.11.0-dev.2680+a1aa55ebe

This is relatively important for us at TigerBeetle.

Expected Behavior

Be as fast as Rust. Ideally, be as fast as asm.

matklad avatar Apr 20 '23 16:04 matklad

The current BLAKE3 implementation is just a direct port of the reference implementation. It's definitely slow.

As an alternative, Aegis MAC that we already have in stdlib is super fast on any platform with AES acceleration.

Also, KangarooTwelve is being standardized, is as fast as BLAKE3 or even faster on some platforms, and is very likely to be chosen over BLAKE3 in modern protocols. I expect it to be in the stdlib soon, with great performance.

jedisct1 avatar Apr 20 '23 18:04 jedisct1

I had compared Blake3 to Blake2b512 and Sha256a a while back with the following results:

// Blake3      : 1
// Blake2b512  : 1.3 x slower than Blake3
// Sha256      : 4 x slower than Blake3

Good to know we can improve to go about 4x faster than that even!

marler8997 avatar Apr 20 '23 19:04 marler8997

Currently:

Apple M1:

BLAKE3 AEGIS-128L MAC AEGIS-128X MAC
1.56 GB/s 15.6 GB/s 14.5 GB/s

AMD EPYC (Zen 2)

BLAKE3 AEGIS-128L MAC AEGIS-128X MAC
4.90 GB/s 20.7 GB/s 31.9 GB/s
  • BLAKE3: Rust crate, with assembly
  • AEGIS-128L from the Zig standard library (std.crypto.auth.aegis.Aegis128LMac)
  • AEGIS-128X (Aegis128XMac)

More recent benchmarks:

image

jedisct1 avatar Apr 20 '23 20:04 jedisct1

The rust implementation seems to more or less just import the c implementation, you could do that too in zig.

recursivetree avatar May 01 '23 14:05 recursivetree

I wrote an Blake3 implementation is Zig. Here are the benchmark results (hyperfine):

Start file:

const std = @import("std");
const config = @import("config");
const blake3 = @import("./root.zig");
const c = @cImport({
    @cInclude("blake3.h");
});

const stderr = std.io.getStdErr();
const stdout = std.io.getStdOut();

fn run() !void {
    if (std.os.argv.len != 2) {
        return error.OneArgRequired;
    }

    const fd = try std.posix.openZ(std.os.argv[1], .{}, undefined);
    defer std.posix.close(fd);

    const stat = try std.posix.fstat(fd);

    const area = try std.posix.mmap(null, @intCast(stat.size), 1, .{ .TYPE = .PRIVATE }, fd, 0);
    defer std.posix.munmap(area);

    var out: [32]u8 = undefined;

    if (config.c) {
        var hasher: c.blake3_hasher = undefined;
        c.blake3_hasher_init(&hasher);
        c.blake3_hasher_update(&hasher, area.ptr, area.len);
        c.blake3_hasher_finalize(&hasher, &out, out.len);
    } else if (config.std) {
        std.crypto.hash.Blake3.hash(area, &out, .{});
    } else {
        blake3.Blake3(.{}).hash(area, &out);
    }

    try stdout.writer().print("{s}\n", .{std.fmt.bytesToHex(out, .lower)});
}

pub fn main() u8 {
    run() catch |err| {
        stderr.writer().print("Error: {}\n", .{err}) catch unreachable;
        return 1;
    };
    return 0;
}

c-blake3 uses config.c = true and config.std = false and compiles blake3.c, blake3_dispatch.c, blake3_portable.c, blake3_sse2.c, blake3_sse41.c and blake3_avx2.c.

c-asm-blake3 uses config.c = true and config.std = false and compiles blake3.c, blake3_dispatch.c, blake3_portable.c, blake3_sse2_x86-64_unix.S, blake3_sse41_x86-64_unix.S and blake3_avx2_x86-64_unix.S.

zig-std-blake3 uses config.c = false and config.std = true.

zig-blake3 uses config.c = false and config.std = false (my impl).

Benchmark 1: b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
  Time (mean ± σ):     238.3 ms ±   7.2 ms    [User: 213.4 ms, System: 22.6 ms]
  Range (min … max):   230.7 ms … 265.2 ms    500 runs
 
Benchmark 2: ./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     229.7 ms ±   6.5 ms    [User: 207.3 ms, System: 22.1 ms]
  Range (min … max):   223.9 ms … 268.0 ms    500 runs
 
Benchmark 3: ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     268.3 ms ±  27.2 ms    [User: 245.8 ms, System: 22.2 ms]
  Range (min … max):   246.2 ms … 362.6 ms    500 runs
 
Benchmark 4: ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     240.3 ms ±   2.6 ms    [User: 217.4 ms, System: 22.6 ms]
  Range (min … max):   238.7 ms … 287.0 ms    500 runs
 
Benchmark 5: ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     925.5 ms ±  13.7 ms    [User: 903.9 ms, System: 21.1 ms]
  Range (min … max):   918.1 ms … 1064.6 ms    500 runs
 
Summary
  ./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso ran
    1.04 ± 0.04 times faster than b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
    1.05 ± 0.03 times faster than ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
    1.17 ± 0.12 times faster than ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
    4.03 ± 0.13 times faster than ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso

CPU: Intel Core 12700K

hyperfine.json

I will open a PR soon.

rpkak avatar Jun 07 '24 12:06 rpkak

@rpkak How did it go? I can't find the code or a PR

JonathanHallstrom avatar Oct 10 '24 19:10 JonathanHallstrom

@rpkak How did it go? I can't find the code or a PR

The reason why I have not created a PR until now is that my current implementation produces too much binary code. Also for small inputs my implementation is sometimes a bit slower than the std implementation.

Due to limited time, I am not currently working on this, but I will do this at some point in time.

rpkak avatar Dec 01 '24 10:12 rpkak

Is the code public somewhere? I could work on it

JonathanHallstrom avatar Dec 01 '24 10:12 JonathanHallstrom

Is the code public somewhere? I could work on it

Now the code is public at rpkak/zig-blake3.

rpkak avatar Dec 22 '24 13:12 rpkak

Addressed by #25574 and #25587

We now also have #25593 as an alternative.

jedisct1 avatar Nov 01 '25 06:11 jedisct1