libs-team icon indicating copy to clipboard operation
libs-team copied to clipboard

ACP: efficient runtime checking of multiple target features

Open folkertdev opened this issue 9 months ago • 14 comments

Proposal

Problem statement

Currently, checking for whether two target features are enabled is inefficient. In zlib-rs we see a 3% slowdown in one test case from checking for an additional target feature.

Performing a runtime check for 2 target features requires roughly double the number of instructions versus checking for just one feature.

Motivating examples or use cases

In zlib-rs, we want to check for both the avx2 and bmi2 features, but that check is slower than just checking for avx2.

Looking at just the happy path (where the features are already cached and both are available):

https://godbolt.org/z/f935sP6dr

// using `pclmulqdq` here because avx2 and bmi2 use the same integer constant

pub fn foo() -> bool { 
    std::is_x86_feature_detected!("pclmulqdq")
}

pub fn bar() -> bool { 
    std::is_x86_feature_detected!("avx2") && std::is_x86_feature_detected!("pclmulqdq")
}
example::foo::h4a487a8c8dbb996a:
        mov     rax, qword ptr [rip + std_detect::detect::cache::CACHE::h6b648acf387db542@GOTPCREL]
        mov     rax, qword ptr [rax]
        test    rax, rax
        je      .LBB0_1
        and     eax, 2
        xor     ecx, ecx
        or      rax, rcx
        setne   al
        ret

example::bar::h1992ebebbee721d0:
        push    rbx
        mov     rbx, qword ptr [rip + std_detect::detect::cache::CACHE::h6b648acf387db542@GOTPCREL]
        mov     rax, qword ptr [rbx]
        test    rax, rax
        je      .LBB1_1
        and     eax, 32768
        xor     ecx, ecx
        or      rax, rcx
        je      .LBB1_3
.LBB1_4:
        mov     rax, qword ptr [rbx]
        test    rax, rax
        je      .LBB1_5
.LBB1_6:
        and     eax, 2
        xor     ecx, ecx
        or      rax, rcx
        setne   al
        pop     rbx
        ret

So checking for 2 features roughly doubles the number of instructions, and performs 2 (atomic) loads.

This all makes sense, given that the cache is stored in an atomic, so the read value cannot be reused, and the expansion looks like this:

pub fn bar() -> bool {
    (false || ::std_detect::detect::__is_feature_detected::avx2()) &&
        (false || ::std_detect::detect::__is_feature_detected::pclmulqdq())
}

Solution sketch

I'd like the macro to expand to something like this instead, where __is_feature_detected() returns a bitmap of enabled features:

pub fn bar() -> bool {
    false || {
        let mask = ::std_detect::detect::AVX2 | ::std_detect::detect::PCLMULQDQ;
        ::std_detect::detect::__is_feature_detected() & mask == mask
    }
}

For that to work, a single call to a is_*_feature_detected macro must be able to accept multiple target features. I can see two ways to do that:

  1. is_x86_feature_detected("avx2", "bmi")
  2. is_x86_feature_detected("avx2,bmi")

Option 2 has precedent in e.g. #[target_feature(enable = "avx2,bmi2")], but option 1 can (I believe) be implemented with macro_rules! and also works better with e.g. #[cfg(...)]. I personally prefer option 1.

Alternatives

There is a workaround:

#[inline(always)]
pub fn is_enabled_avx2_and_bmi2() -> bool {
    #[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
    #[cfg(feature = "std")]
    {
        use std::sync::atomic::{AtomicU8, Ordering};

        static CACHE: AtomicU8 = AtomicU8::new(2);

        return match CACHE.load(Ordering::Relaxed) {
            0 => false,
            1 => true,
            _ => {
                let detected = std::is_x86_feature_detected!("avx2") 
                    && std::is_x86_feature_detected!("bmi2");
                CACHE.store(u8::from(detected), Ordering::Relaxed);
                detected
            }
        };
    }

    false
}

Links and related work

  • https://github.com/rust-lang/stdarch/issues/348

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

folkertdev avatar May 09 '25 10:05 folkertdev

Am I missing something? Your example alternative code never checks for bmi.

pitaj avatar May 09 '25 12:05 pitaj

ah right (I've been running a bunch of benchmarks with different configurations). Fixed now, thanks!

folkertdev avatar May 09 '25 12:05 folkertdev

Prior relevant discussion: https://internals.rust-lang.org/t/better-codegen-for-cpu-feature-detection/22083

pitaj avatar May 09 '25 14:05 pitaj

        and     eax, 2
        xor     ecx, ecx
        or      rax, rcx
        setne   al

even that code is inefficient, it could just be:

and eax, 2
shr eax, 1

or if it was immediately used for branching, just:

test eax, 2
jnz has_feature

programmerjake avatar May 09 '25 18:05 programmerjake

This all makes sense, given that the cache is stored in an atomic, so the read value cannot be reused

I know in vtables we add some special LLVM things to tell it that re-reading will always give the same value, even if there's other stuff between.

Obviously we can't do that for the lazy-init part, but maybe in the normal case after that there'd be a way?


returns a bitmap of enabled features

Does it have to be a bitmap? Could we return some kind of x86-specific library type with .avx2() and .bmi2() and such?

(Or even let that type exist on all platforms, just trivially returns false for everything.)

scottmcm avatar May 09 '25 19:05 scottmcm

Sure, it doesn't have to be literally a bitmap, there is a lot of freedom in exactly how to implement it.

One tricky thing is that the features are stored as 3 atomics, so depending on what features you ask for, one load might have all the bits you need, or you might need all 3 loads. So we don't want to repeat work when 2 features are stored in the same atomic value, but also don't want to pessimistically load all three values.

https://github.com/rust-lang/stdarch/blob/f1c1839c0deb985a9f98cbd6b38a6d43f2df6157/crates/std_detect/src/detect/cache.rs#L76-L80

static CACHE: [Cache; 3] = [
    Cache::uninitialized(),
    Cache::uninitialized(),
    Cache::uninitialized(),
];

struct Cache(AtomicUsize);

Also looking at this now, that static might benefit from #[align] to make sure it gets its own cache line.

folkertdev avatar May 09 '25 19:05 folkertdev

Note that this has some interaction with the accepted RFC for adding splitting these macros into core and std parts https://github.com/Amanieu/rfcs/blob/core-detect/text/0000-core_detect.md. (@sayantn expressed some interest in possibly picking up that part).

tgross35 avatar May 09 '25 21:05 tgross35

In https://github.com/rust-lang/rfcs/pull/3469#issuecomment-2895848758 I argue that the whole system likely needs to be re-designed anyways, so this would be a good opportunity to provide a better API as part of the re-design.

Amanieu avatar May 20 '25 21:05 Amanieu

What about

enum TargetFeature {
    Sse,
    Sse2,
    ...
}

pub fn are_features_detected<const N: size>(features: [TargetFeature; N]) -> bool;

pitaj avatar May 20 '25 22:05 pitaj

I do think this is going to need to be redesigned in the future (e.g. for things like passing around a type-level proof that you have a given target feature), but in the short-term, I personally think it'd be reasonable to support passing multiple arguments to is_x86_feature_detected!.

joshtriplett avatar Jun 03 '25 16:06 joshtriplett

We discussed this in today's meeting. @Amanieu observed that we may want support for "or" rather than "and". After some discussion, we landed on supporting boolean && and || syntax in the feature detection macros.

is_x86_feature_detected!(("f1" && "f2") || "f3")

We'd be happy to accept a PR implementing this syntax on nightly, for each of the feature detection macros. (The new syntax should be feature-gated, not insta-stable.)

joshtriplett avatar Jun 03 '25 17:06 joshtriplett

What’s an example of when one would want “or”? Every target feature check I’ve ever read or written has been “and” because it’s always about “can I soundly call these core::arch functions” and I don’t think rustc/LLVM even has a notion of “this function needs either this target feature or that target feature”. There are implications of the form “feat1 implies feat3, feat2 implies feat3” where feat1 and feat2 are independent or even mutually exclusive, but these implications should be resolved earlier so the user’s feature detection can just check the relevant feat3 directly.

hanna-kruppe avatar Jun 04 '25 13:06 hanna-kruppe

Based on the meeting notes apparently RISC-V would get some use out of the "or" operation.

folkertdev avatar Jun 04 '25 14:06 folkertdev

This ACP has been accepted.

Please open a tracking issue and open a PR to rust-lang/rust to add it as an unstable feature. You can close this ACP once the tracking issue has been created.

Amanieu avatar Jun 10 '25 17:06 Amanieu