runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Vectorize IndexOfAnyExcept<T>(T value)

Open stephentoub opened this issue 3 years ago • 2 comments

Contributes to https://github.com/dotnet/runtime/issues/67942

[Params(1, 4, 16, 64, 256, 1024)]
public int Length { get; set; }

private byte[] _zeros;

[GlobalSetup]
public void Setup() => _zeros = new byte[Length];

[Benchmark]
public bool AllZeroTrue() => _zeros.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio Code Size
AllZeroTrue \main\corerun.exe 1 0.6482 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1 1.7013 ns 2.62 242 B
AllZeroTrue \main\corerun.exe 4 2.6035 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 4 3.5931 ns 1.38 242 B
AllZeroTrue \main\corerun.exe 16 4.8521 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 16 2.3925 ns 0.49 242 B
AllZeroTrue \main\corerun.exe 64 18.6816 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 64 3.5615 ns 0.19 242 B
AllZeroTrue \main\corerun.exe 256 81.8503 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 256 10.3610 ns 0.13 242 B
AllZeroTrue \main\corerun.exe 1024 310.6106 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1024 37.1116 ns 0.12 242 B

The one thing that makes me a tad hesitant is the case of the except value being found at the very beginning, in which case this does incur a (very small) penalty even for really long inputs, e.g.

[Params(1024)]
public int Length { get; set; }

private byte[] _allBitsSet;

[GlobalSetup]
public void Setup() => _allBitsSet = Enumerable.Repeat((byte)0xFF, Length).ToArray();

[Benchmark]
public bool AllBitsSetLookingForNon0() => _allBitsSet.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio RatioSD Code Size
AllBitsSetLookingForNon0 \main\corerun.exe 1024 0.4979 ns 1.00 0.00 72 B
AllBitsSetLookingForNon0 \pr\corerun.exe 1024 1.7421 ns 3.51 0.11 242 B

I'm not sure what if anything to do about it.

stephentoub avatar Aug 05 '22 18:08 stephentoub

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details

Contributes to https://github.com/dotnet/runtime/issues/67942

[Params(1, 4, 16, 64, 256, 1024)]
public int Length { get; set; }

private byte[] _zeros;

[GlobalSetup]
public void Setup() => _zeros = new byte[Length];

[Benchmark]
public bool AllZeroTrue() => _zeros.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio Code Size
AllZeroTrue \main\corerun.exe 1 0.6482 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1 1.7013 ns 2.62 242 B
AllZeroTrue \main\corerun.exe 4 2.6035 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 4 3.5931 ns 1.38 242 B
AllZeroTrue \main\corerun.exe 16 4.8521 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 16 2.3925 ns 0.49 242 B
AllZeroTrue \main\corerun.exe 64 18.6816 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 64 3.5615 ns 0.19 242 B
AllZeroTrue \main\corerun.exe 256 81.8503 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 256 10.3610 ns 0.13 242 B
AllZeroTrue \main\corerun.exe 1024 310.6106 ns 1.00 72 B
AllZeroTrue \pr\corerun.exe 1024 37.1116 ns 0.12 242 B

The one thing that makes me a tad hesitant is the case of the except value being found at the very beginning, in which case this does incur a (very small) penalty even for really long inputs, e.g.

[Params(1024)]
public int Length { get; set; }

private byte[] _allBitsSet;

[GlobalSetup]
public void Setup() => _allBitsSet = Enumerable.Repeat((byte)0xFF, Length).ToArray();

[Benchmark]
public bool AllBitsSetLookingForNon0() => _allBitsSet.AsSpan().IndexOfAnyExcept((byte)0) < 0;
Method Toolchain Length Mean Ratio RatioSD Code Size
AllBitsSetLookingForNon0 \main\corerun.exe 1024 0.4979 ns 1.00 0.00 72 B
AllBitsSetLookingForNon0 \pr\corerun.exe 1024 1.7421 ns 3.51 0.11 242 B

I'm not sure what if anything to do about it.

Author: stephentoub
Assignees: -
Labels:

area-System.Memory

Milestone: 7.0.0

msftbot[bot] avatar Aug 05 '22 18:08 msftbot[bot]

Failure is https://github.com/dotnet/runtime/issues/73247

stephentoub avatar Aug 06 '22 15:08 stephentoub

@adamsitnik, can you please review this? Thanks.

stephentoub avatar Aug 10 '22 09:08 stephentoub

Thanks for reviewing, Adam.

stephentoub avatar Aug 10 '22 11:08 stephentoub

I'm not sure what if anything to do about it.

I think this is an acceptable tradeoff. The measured difference is just over 1ns which would be about 2-4 clock cycles on most modern computers, this is likely representative of the jump to IndexOfAnyExceptValueType and branch cost for checking the length.

In practice, the actual spilling of args, call, and general memory latency is more than this so most users likely won't see a difference. It's likewise a general avenue we might want to look at improved JIT support around, for being able to ensure that cost for such checks can be minimized.

One think that might help would be making it so that you have a pattern like the following and marking the containing method as AggressiveInlining

if (length < Vector128<T>.Count)
{
    ScalarImpl();
}
else
{
    VectorImpl();
}

This would theoretically allow the JIT to hoist, constant fold, or otherwise mitigate the branch for various input kinds.

tannergooding avatar Aug 10 '22 15:08 tannergooding

this is likely representative of the jump to IndexOfAnyExceptValueType and branch cost for checking the length.

To be clear, I wasn't referring to the case where there are fewer than Count elements, rather the case where there are more than Count elements and the very first element doesn't match the specified value. I expect that to be a fairly common occurrence with some uses of this method.

stephentoub avatar Aug 11 '22 01:08 stephentoub

Hm.. looks like this regressed System.Collections.Concurrent.IsEmpty<String>.Dictionary(Size: 512) but none regressions were filed image

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Collections.Concurrent.IsEmpty(String).Dictionary(Size%3a%20512).html

(I blamed this PR because ConcurrentDictionary uses IndexOfAnyExcept for AreAllBucketsEmpty)

EgorBo avatar Sep 13 '22 12:09 EgorBo

looks like this regressed System.Collections.Concurrent.IsEmpty<String>.Dictionary(Size: 512) but none regressions were filed

By a nanosecond or two, right? That's largely expected. I'd expect IsEmpty to get a tad slower when the dictionary isn't empty and faster when it is empty; basically an extra branch or two in the check.

stephentoub avatar Sep 13 '22 20:09 stephentoub

I totally agree and didn't file an issue for it, it's just needed for perf study report

EgorBo avatar Sep 13 '22 20:09 EgorBo