Race condition in validation of VK_KHR_deferred_host_operations functions
There appears to be a race condition potentially preventing pipeline registration when using vkDeferredOperationJoinKHR and vkGetDeferredOperationResultKHR in a multi-threaded environment following a vkCreateRayTracingPipelinesKHR call. Consider the code example from the extension's documentation page (https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_deferred_host_operations.html):
VkResult result = vkDeferredOperationJoinKHR(device, hOp);
while( result == VK_THREAD_IDLE_KHR )
{
std::this_thread::yield();
result = vkDeferredOperationJoinKHR(device, hOp);
}
switch( result )
{
case VK_SUCCESS:
{
// deferred operation has finished. Query its result
result = vkGetDeferredOperationResultKHR(device, hOp);
}
break;
case VK_THREAD_DONE_KHR:
{
// deferred operation is being wrapped up by another thread
// wait for that thread to finish
do
{
std::this_thread::yield();
result = vkGetDeferredOperationResultKHR(device, hOp);
} while( result == VK_NOT_READY );
}
break;
default:
assert(false); // other conditions are illegal.
break;
}
When multiple threads execute this code on the same operation in parallel, it is possible that inside the call to vkGetDeferredOperationResultKHR, the validator will fail to register the pipeline. More specifically, consider the scenario where one thread gets VK_THREAD_DONE_KHR from vkDeferredOperationJoinKHR, and then enters the spinning vkGetDeferredOperationResultKHR loop. Meanwhile, another thread is processing the last chunk of work in its vkDeferredOperationJoinKHR call. As soon as the work internally completes, the implementation can return VK_SUCCESS on the other thread currently spinning on GetDeferredOperationResultKHR. It is therefore possible (and in my experience very likely) that a thread that received VK_THREAD_DONE_KHR from the Join call will process the if (result == VK_SUCCESS) registration epilogue in DispatchGetDeferredOperationResultKHR before the thread completing the deferred operation in its vkDeferredOperationJoinKHR (and thus ultimately receiving VK_SUCCESS from it) call runs the registration epilogue in DispatchDeferredOperationJoinKHR. This will result in the deferred_operation_pipelines not containing the pipeline yet, and therefore in the cleanup_fn callbacks not being triggered. Ultimately, this results in the next API call trying to use the pipeline failing because it is not seen as created by the validator.
This can lead to the following validation error:
Validation Error: [ VUID-vkGetRayTracingShaderGroupHandlesKHR-pipeline-parameter ] Object 0: handle = 0x2cc2f6ca5e0, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x1f2e8acf | Invalid VkPipeline Object 0x7ce53000000349de. The Vulkan spec states: pipeline must be a valid VkPipeline handle (https://vulkan.lunarg.com/doc/view/1.3.231.1/windows/1.3-extensions/vkspec.html#VUID-vkGetRayTracingShaderGroupHandlesKHR-pipeline-parameter)
We can workaround the issue by ensuring that only the completing thread (receiving VK_SUCCESS from vkDeferredOperationJoinKHR) calls vkGetDeferredOperationResultKHR, but it appears like this shouldn't be required.
Environment:
- OS: Windows 10 22H2
- GPU: GeForce 3080
- SDK or header version if building from repo: SDK 1.3.231
FYI @jeremyg-lunarg
Just to clarify (for my own understanding), the problematic scenario you're referring to is something like:
Thread 1: t0: vkDeferredOperationJoinKHR == VK_SUCCESS t3: vkGetDeferredOperationResultKHR (triggers error because vkGetDeferredOperationResultKHR at t2 has removed the pipeline)
Thread 2: t1: vkDeferredOperationJoinKHR == VK_THREAD_DONE_KHR t2: vkGetDeferredOperationResultKHR == VK_SUCCESS
Where tN represents operation order. I wonder if TlsGuard could possibly be used here for deferred_operation_pipelines storage?
Apologies for not making this entirely clear in the first place. Here's an example of timeline that would trigger the issue:
Thread 1:
t0: Call vkDeferredOperationJoinKHR
t2: vkDeferredOperationJoinKHR == VK_THREAD_DONE_KHR
t3: Start spinning loop on vkGetDeferredOperationResultKHR as long as we're getting VK_NOT_READY.
t5: layer_data->device_dispatch_table.GetDeferredOperationResultKHR == VK_SUCCESS
t6: DispatchGetDeferredOperationResultKHR epilogue: Problem, pipeline hasn't been inserted yet to deferred_operation_pipelines -> deferred_operation_post_check is popped but not executed.
Thread2:
t1: Call vkDeferredOperationJoinKHR
t4: Deferred operation work internally completes, but vkDeferredOperationJoinKHR hasn't returned just yet
t7: layer_data->device_dispatch_table.DeferredOperationJoinKHR == VK_SUCCESS
t8: DispatchDeferredOperationJoinKHR epilogue: Execute deferred_operation_post_completion callbacks, adding pipeline to deferred_operation_pipelines, but too late.
t9: DispatchGetDeferredOperationResultKHR epilogue: deferred_operation_post_check is empty now, CreateObject create object is never called on the pipeline.
Any thread: t10: Any API call using the RT pipeline will report a validation error because CreateObject wasn't called.
I hope this clarifies the issue a bit better.
As a side note, the specification for vkDeferredOperationJoinKHR states that the application should call vkGetDeferredOperationResultKHR, but does not state that it must. I can't think of a good for an application not to call vkGetDeferredOperationResultKHR, but it sounds like it should be supported. A possibly legal (but potentially dangerous) bit of code would for example just assume that the RT pipeline was successfully created upon receiving VK_SUCCESS from any vkDeferredOperationJoinKHR call (assuming that the pipeline creation succeeded). The current validation code however appears to rely on the vkGetDeferredOperationResultKHR to internally track pipeline creation (unless I missed another hook somewhere else?), which would presumably break in the same way.
Ran into the same issue and it seems to have caused a segfault on this line: https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/62c65a6179ee0b1a065c33906fba8f44bcebcc0c/layers/core_checks/cc_pipeline_ray_tracing.cpp#L46