Parallel vkCreateComputePipelines calls are extremely slow with shader instrumentation

Open Red-Panda64 opened this issue 4 months ago • 1 comments

Environment:

GPU and driver version: NVIDIA RTX A4000 w/ 535.230
OS: RHEL8
build version: 270dbea039c2b485e33330be58a3e0ebcaef485f
Options enabled: printf_enable or gpuav_enable

Describe the Issue

The GPUAV validations cause immense performance overhead (factor > 20) on vkCreateComputePipelines when it is invoked in parallel. Moreover, it is even sufficient for the vkCreateComputePipelines calls to run sequentially in some non-deterministic order. This problem also occurs hen GPUAV is disabled, but debug printf is done via the gpuav_shader_instrumentor.cpp.

The cause for the slowdown is apparently that the compute pipelines created in this way do not benefit from the driver's shader cache (at ~/.cache/nvidia/GLCache by default). The size of the cache increases continually across multiple runs of the same program, even if it always creates the same pipelines (in a different order). This seems to point to some state that is held by the shader instrumentor, which results in different SPIR-V for different invocation orders.

Expected behavior

Enabling debug printf or gpu assisted validation does not circumvent the driver's shader cache mechanism for repeated runs. They should not unreasonably increase pipeline creation time.

Additional context

To reproduce the issue, you can apply this patch to the gpuav stress test:

diff --git a/tests/stress/gpu_av_stress.cpp b/tests/stress/gpu_av_stress.cpp
index fed19c0df..4f77265ba 100644
--- a/tests/stress/gpu_av_stress.cpp
+++ b/tests/stress/gpu_av_stress.cpp
@@ -21,6 +21,11 @@
 #include "../framework/descriptor_helper.h"
 #include "gpu_av_helper.h"
 
+#include <algorithm>
+#include <chrono>
+#include <cstdlib>
+#include <random>
+
 // If on Mesa, also suggest using MESA_SHADER_CACHE_DISABLE=1
 class StressGpuAV : public VkLayerTest {
   public:
@@ -109,7 +114,9 @@ TEST_F(StressGpuAV, DescriptorIndexing) {
                                             VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL, 2);
     descriptor_set.UpdateDescriptorSets();
 
-    const char *cs_source = R"glsl(
+    std::vector<std::string> cs_sources;
+
+    const char *cs_source_prefix = R"glsl(
         #version 450
         #extension GL_EXT_nonuniform_qualifier : enable
 
@@ -124,7 +131,8 @@ TEST_F(StressGpuAV, DescriptorIndexing) {
         }
 
         vec4 bar(uint index) {
-           vec4 result = vec4(1.0);
+           vec4 result = vec4()glsl";
+    const char *cs_source_suffix = R"glsl();
            result -= texture(tex[index], vec2(0.1, 5.0));
            result -= texture(tex[index], vec2(0.2, 5.0));
            result -= texture(tex[index], vec2(0.3, 5.0));
@@ -175,20 +183,29 @@ TEST_F(StressGpuAV, DescriptorIndexing) {
            result += bar(data.index + 2);
         }
     )glsl";
-
-    CreateComputePipelineHelper pipe(*this);
-    pipe.cs_ = VkShaderObj(this, cs_source, VK_SHADER_STAGE_COMPUTE_BIT, SPV_ENV_VULKAN_1_2);
-    pipe.cp_ci_.layout = pipeline_layout;
-    pipe.CreateComputePipeline();
-
-    m_command_buffer.Begin();
-    vk::CmdBindPipeline(m_command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipe);
-    vk::CmdBindDescriptorSets(m_command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline_layout, 0, 1, &descriptor_set.set_, 0,
-                              nullptr);
-    vk::CmdDispatch(m_command_buffer, 1, 1, 1);
-    m_command_buffer.End();
-
-    m_default_queue->SubmitAndWait(m_command_buffer);
+    const char *seed_str = std::getenv("STRESS_TEST_SEED");
+    unsigned long long seed;
+    if(seed_str) {
+        seed = std::strtoull(seed_str, NULL, 0);
+    } else {
+        seed = std::chrono::system_clock::now().time_since_epoch().count();
+    }
+    constexpr size_t N = 50;
+    for(size_t i = 0; i < N; i++) {
+        std::stringstream cs_source;
+        cs_source << cs_source_prefix << i << cs_source_suffix;
+        cs_sources.emplace_back(cs_source.str());
+    }
+    auto engine = std::default_random_engine{};
+    engine.seed(seed);
+    std::shuffle(cs_sources.begin(), cs_sources.end(), engine);
+
+    for(auto &cs_source : cs_sources){
+        CreateComputePipelineHelper pipe(*this);
+        pipe.cs_ = VkShaderObj(this, cs_source.c_str(), VK_SHADER_STAGE_COMPUTE_BIT, SPV_ENV_VULKAN_1_2);
+        pipe.cp_ci_.layout = pipeline_layout;
+        pipe.CreateComputePipeline();
+    }
 }
 
 TEST_F(StressGpuAV, DescriptorIndexing2) {

This test stitches together 50 slightly different shaders and shuffles them. The result of repeatedly running this test as ctest -R StressGpuAV.DescriptorIndexing$ are runtimes of 22.18s, 21.74s, 22.29s, etc. and an ever growing shader cache. However, running it with a fixed seed as STRESS_TEST_SEED=0 ctest -R StressGpuAV.DescriptorIndexing$ yields 21.27s, 0.79s, 0.77s, etc. and the shader cache does not grow after the first run.

Oct 14 '25 12:10 Red-Panda64

I will take a look, thanks for mocking this up in the Stress test for us

I personally develop on AMD/Intel Mesa Linux machines and know the shader cache works for the Mesa drivers, I have not explored the Linux NVIDIA drivers and this seems suspicious. When we run GPU-AV Shader Instrumentation, it is suppose to be very deterministic and the output SPIR-V generated should be the same across multiple runs

Oct 14 '25 12:10 spencer-lunarg