nvbench Fixes cudaErrorInvalidValue when running on nvbench-created cuda stream

This PR fixes a minor issue that may occur when nvbench is run on multiple GPUs without a user-provided cuda stream.

The issue

The error that I observed in this case looked like:

Fail: Unexpected error: nvbench/detail/l2flush.cuh:55: Cuda API call returned error: cudaErrorInvalidValue: invalid argument

When run with memcheck I would see:

Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync.

The Problem

It seems that nvbench is creating all the nvbench-owned streams on device 0.

Suggested Fix

This fix makes sure that the streams are created on the device on which they are later on used.

Dec 07 '22 12:12 elstehle

This LGTM, thanks for catching it! Some of the tests don't build after the changes, you can run ci/local/build.bash from the nvbench root to build and test if you have docker setup.

Once tests are passing this is good to go.

Thanks for reviewing the PR. nvbench::cuda_stream used to be default constructible and also be part of the public API. In this PR, I required passing a std::optional<nvbench::device_info> to cuda_stream's ctor, which sort of was a breaking change. To avoid the breaking change, I've now added back the default ctor to cuda_stream.

Jan 18 '23 15:01 elstehle

@elstehle I'm still seeing a test regression when running ci/local/build.bash on this branch:

 4/39 Test #32: nvbench.test.state_generator ..................***Failed    2.39 sec
/cccl/nvbench/nvbench/detail/device_scope.cuh:37: Cuda API call returned error: cudaErrorInvalidDevice: invalid device ordinal
Command: 'cudaSetDevice(dev_id)'

Jan 30 '23 18:01 alliepiper

@elstehle I'm still seeing a test regression when running ci/local/build.bash on this branch:

 4/39 Test #32: nvbench.test.state_generator ..................***Failed    2.39 sec
/cccl/nvbench/nvbench/detail/device_scope.cuh:37: Cuda API call returned error: cudaErrorInvalidDevice: invalid device ordinal
Command: 'cudaSetDevice(dev_id)'

Thanks! Sorry, I've had missed that regression as it only occurred on systems with three devices or less.

Issue with the test in testing/state_generator.cu was that we generate states for devices [0, 1, 2], independent of whether those devices existed or not:

const auto device_0 = nvbench::device_info{0, {}};
const auto device_1 = nvbench::device_info{1, {}};
const auto device_2 = nvbench::device_info{2, {}};

dummy_bench bench;
bench.set_devices({device_0, device_1, device_2});
...
const std::vector<nvbench::state> states = nvbench::detail::state_generator::create(bench);

When the states are created, we create the stream for each state on that state's given device. If a given device doesn't exist, we run into a cuda error.

For comparison, if we'd currently run a benchmark with invalid device ids, the runner would fail with the same error.

../nvbench/device_info.cuh:71: Cuda API call returned error: cudaErrorInvalidDevice: invalid device ordinal

I resolved this regression by adjusting the test in testing/state_generator.cu to only run on devices actually available in the system. But I would like to confirm that we're generally ok with that behaviour.

Jan 31 '23 17:01 elstehle