sha2: explore addition of SSE and AVX2 backends for SHA-256
Currently we only have software and SHA-NI backends for SHA-256.
Intel has a paper describing various SIMD implementations of SHA-256 here:
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/sha-256-implementations-paper.pdf
Notably they describe these two variants:
-
sha256_avx2_rorx2 -
sha256_avx2_rorx8
The Intel® 64 processors built on 22nm process technology introduce rorx, which is an instruction set enhancement that allows nondestructive fast rotates by a constant, and 256-bit SIMD-integer instructions with Intel® AVX2.
Two versions use the rorx instruction:
sha256_avx2_rorx2, andsha256_avx2_rorx8. The former is optimized for smaller buffers, a smaller memory foot-print, and single-thread execution. The latter is optimized for larger buffers, particularly when Intel® Hyper-Threading Technology (Intel® HT Technology) is enabled, but it has a larger memory footprint, which might result in worse data-cache behavior.
I'm not sure if there are newer/better methods available for implementing SHA-256 with AVX2, but that's the best I was able to find.
some code can be found here :
https://web.archive.org/web/20191119125017/https://downloadmirror.intel.com/22357/eng/sha256_code_release_v2.zip
or here :
https://github.com/intel/isa-l_crypto/tree/master/sha256_mb
(links taken from https://stackoverflow.com/questions/18546244/sha256-performance-optimization-in-c)
I see that Python 3's implementation of sha256 is x2 faster than Rust's one from sha2 crate. And consequences are so big that this makes great argument for not using Rust at all (see reasons below).
Tests done in Debian Linux sid in docker on Librem Purism. x86_64-unknown-linux-gnu. rustc 1.72.0-nightly. Here is sh.py3:
#!/usr/bin/env python3
import io, hashlib
with open("/home/user/dedup-bench/sto/00", "rb") as f:
digest = hashlib.file_digest(f, "sha256")
print(digest.hexdigest())
/home/user/dedup-bench/sto/00 is raw disk image. Size is 2147483648 bytes.
Here is Cargo.toml:
[package]
name = "s"
version = "0.1.0"
edition = "2021"
[dependencies]
hex = "0.4.3"
#sha2 = { version = "0.10.7", features = ["asm"] }
sha2 = { version = "0.10.7", features = [] }
main.rs:
fn main() {
use sha2::Digest;
use std::io::Read;
let mut hasher = sha2::Sha256::new();
let mut buf = vec![];
std::fs::File::open("/home/user/dedup-bench/sto/00").unwrap().read_to_end(&mut buf).unwrap();
hasher.update(&buf);
println!("{}", hex::encode(hasher.finalize()));
}
Here are results:
<[sid]>root@377503f894a2:~/s# time -p python3 ~/sh.py3
97856e0ae559f80dc3cde36cc23a3e37ae3ce2b92eed22ae5d7ec55e36c4401d
real 5.83
user 5.41
sys 0.41
<[sid]>root@377503f894a2:~/s# time -p python3 ~/sh.py3
97856e0ae559f80dc3cde36cc23a3e37ae3ce2b92eed22ae5d7ec55e36c4401d
real 5.79
user 5.44
sys 0.35
<[sid]>root@377503f894a2:~/s# time -p cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/s`
97856e0ae559f80dc3cde36cc23a3e37ae3ce2b92eed22ae5d7ec55e36c4401d
real 13.28
user 12.24
sys 1.03
<[sid]>root@377503f894a2:~/s# time -p cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/s`
97856e0ae559f80dc3cde36cc23a3e37ae3ce2b92eed22ae5d7ec55e36c4401d
real 13.25
user 12.22
sys 1.03
Results with asm are slightly better, but still way worse than Python's:
<[sid]>root@377503f894a2:~/s# time -p cargo run --release
Compiling s v0.1.0 (/root/s)
Finished release [optimized] target(s) in 0.27s
Running `target/release/s`
97856e0ae559f80dc3cde36cc23a3e37ae3ce2b92eed22ae5d7ec55e36c4401d
real 11.33
user 10.29
sys 1.10
<[sid]>root@377503f894a2:~/s# time -p cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/s`
97856e0ae559f80dc3cde36cc23a3e37ae3ce2b92eed22ae5d7ec55e36c4401d
real 11.06
user 10.03
sys 1.03
Here is /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
stepping : 4
microcode : 0x2f
cpu MHz : 3061.085
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown
bogomips : 6185.31
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
stepping : 4
microcode : 0x2f
cpu MHz : 3092.733
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown
bogomips : 6185.31
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Both crate cpufeatures and is_x86_feature_detected!("sha") report that my CPU does not have sha extension.
So sha extension is not supported. And Python beats Rust here. I think this my report belongs to this issue.
I discovered this problem when I compared borg performance and my own simple Rust program: https://github.com/borgbackup/borg/issues/7674 . Hash computing is main reason of slowness of my Rust implementation. It is impossible to beat borg speed in sha256 mode for Rust. So, if for some reason I have need to create borg replacement with sha256, it will be simply impossible (with current state of ecosystem) to create proper alternative in Rust. So currently bad sha2 performance is valid reason to not chose Rust at all. So, please, raise importance of this issue
Also please note: I personally don't need fast sha256 in Rust. I will simply choose blake3 for my application. But I still think that this problem is important for others
There was some discussion on #490 about improving AVX2 performance /cc @codahale
I initially thought the discrepancy was partly due to memory allocation, but no... python3 (actually openssl because it's what's used underneath) is 2x faster
Here I try to measure only the actual "sha256" compute part, not the read/memory allocate part:
/tmp/random contains 2 GB of random data
dd if=/dev/urandom of=/tmp/random bs=1M count=2048
import hashlib
import time
import sys
data = open(sys.argv[1], 'rb').read()
s = hashlib.sha256()
begin = time.time()
s.update(data)
end = time.time()
print(f"{s.hexdigest()} - took {end - begin}")
results:
$ python3 /tmp/bench.py /tmp/random
fa5e60cb2e35b5f84690451972de7e895307ee1fd294d2a365466be93b1ccdc2 - took 3.6656391620635986
$ python3 /tmp/bench.py /tmp/random
fa5e60cb2e35b5f84690451972de7e895307ee1fd294d2a365466be93b1ccdc2 - took 3.675901174545288
$ python3 /tmp/bench.py /tmp/random
fa5e60cb2e35b5f84690451972de7e895307ee1fd294d2a365466be93b1ccdc2 - took 3.660248279571533
rust counterpart:
use std::time::{Duration, Instant};
fn main() {
use sha2::Digest;
use std::io::Read;
let mut hasher = sha2::Sha256::new();
let mut buf = vec![];
std::fs::File::open("/tmp/random").unwrap().read_to_end(&mut buf).unwrap();
let begin = Instant::now();
hasher.update(&buf);
let end = Instant::now();
let duration = end - begin;
println!("{} - {:?}", hex::encode(hasher.finalize()), duration);
}
no feature:
$ cargo run --release
fa5e60cb2e35b5f84690451972de7e895307ee1fd294d2a365466be93b1ccdc2 - 7.751759982s
$ cargo run --release
fa5e60cb2e35b5f84690451972de7e895307ee1fd294d2a365466be93b1ccdc2 - 7.744572511s
"asm" feature
$ cargo run --release
fa5e60cb2e35b5f84690451972de7e895307ee1fd294d2a365466be93b1ccdc2 - 6.803209144s
$ cargo run --release
fa5e60cb2e35b5f84690451972de7e895307ee1fd294d2a365466be93b1ccdc2 - 6.81487079s
I just checked crate "openssl". It (predictably) has nearly same speed as python. So my claim "bad sha2 performance is valid reason to not chose Rust at all" was too bold. But, of course, having Rust-native fast sha256 is good thing
We would like to eventually integrate OpenSSL's assembly (see https://github.com/RustCrypto/asm-hashes/issues/5), but it requires a fair amount of work since we do not want to rely on Perl and external compilers.
@safinaskar @mat-gas I was looking into seeing how Rust performance was for sha256 and out of curiosity ran your tests (I have an AMD processor though) and I am getting similar results. I used a kubuntu iso for the test
python3.11 ./sh.py3
0356bd2d13d7d8d4fc26b16f676c04a396e29c48996736ac88623edfa9dbeb75 - took 1.2526683807373047
Without ASM:
cargo run --release
0356bd2d13d7d8d4fc26b16f676c04a396e29c48996736ac88623edfa9dbeb75 - took 1.242192767s
With ASM:
cargo run --release
0356bd2d13d7d8d4fc26b16f676c04a396e29c48996736ac88623edfa9dbeb75 - took 1.241353036s
By the way, you don't need a separate dependency for hex, there is
println!("{:02x} - took {:?}", hasher.finalize(), duration);
I have an AMD processor though)
You are likely getting results for the SHA-NI backend. Enabling the asm feature has not effect on it. And it's unsurprising that OpenSSL and sha2 have the approximately same time, both under the hood use effectively the same sequence of instructions for block processing on CPUs with SHA-NI available.
By the way, you don't need a separate dependency for hex
Note, that eventually we plan to migrate to const generics and such code will stop working.
Ah, I see I though that it was part of SSE, but didn't realize there was a new SHA-NI extension added to processors a few years back. Tried it on older computer and as said the results were 2x slower :(
FWIW we've been added to SUPERCOP, you can see the results across several CPUs here:
https://bench.cr.yp.to/impl-hash/sha256.html
We're rust_sha2, whereas rust_crypto is the old rust-crypto crate.