Vulkan-ValidationLayers icon indicating copy to clipboard operation
Vulkan-ValidationLayers copied to clipboard

possible Android memory corruption in validation or SPIRV used by validation

Open lunarpapillo opened this issue 1 year ago • 8 comments

Environment:

  • OS: Android
  • GPU and driver version: N/A, crash appears on all tested Android devices
  • SDK or header version if building from repo: Android NDK 26.3
  • Options enabled (synchronization, best practices, etc.):

Describe the Issue

When building and testing a Debug build using Android NDK 26.3, tests crash on all devices in the same place in VkArmBestPracticesLayerTest.ComputeShaderBadSpatialLocalityTest, inside an allocator within SPIRV-Tools:

#00 libVkLayer_khronos_validation.so (void std::__ndk1::allocator<unsigned int>::construct[abi:v170000]<unsigned int, unsigned int const&>(unsigned int*, unsigned int const&)+28)
...
#04 libVkLayer_khronos_validation.so (std::__ndk1::__wrap_iter<unsigned int*> std::__ndk1::vector<unsigned int, std::__ndk1::allocator<unsigned int> >::insert<std::__ndk1::__wrap_iter<unsigned int const*>, 0>(std::__ndk1::__wrap_iter<unsigned int const*>, std::__ndk1::__wrap_iter<unsigned int const*>, std::__ndk1::__wrap_iter<unsigned int const*>)+344) 
#05 libVkLayer_khronos_validation.so (spvtools::val::ValidationState_t::RegisterUniqueTypeDeclaration(spvtools::val::Instruction const*)+416)
#06 libVkLayer_khronos_validation.so (spvtools::val::(anonymous namespace)::ValidateUniqueness(spvtools::val::ValidationState_t&, spvtools::val::Instruction const*)+172)
#07 libVkLayer_khronos_validation.so (spvtools::val::TypePass(spvtools::val::ValidationState_t&, spvtools::val::Instruction const*)+88) 
#08 libVkLayer_khronos_validation.so (spvtools::val::(anonymous namespace)::ValidateBinaryUsingContextAndValidationState(spv_context_t const&, unsigned int const*, unsigned long, spv_diagnostic_t**, spvtools::val::ValidationState_t*)+3824) 
#09 libVkLayer_khronos_validation.so (spvValidateWithOptions+164)
#10 libVkLayer_khronos_validation.so (CoreChecks::RunSpirvValidation(spv_const_binary_t&, Location const&, ValidationCache*) const+296)
#11 libVkLayer_khronos_validation.so (CoreChecks::ValidateShaderModuleCreateInfo(VkShaderModuleCreateInfo const&, Location const&) const+692)
#12 libVkLayer_khronos_validation.so (CoreChecks::PreCallValidateCreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**, ErrorObject const&) const+104) 
#13 libVkLayer_khronos_validation.so (vulkan_layer_chassis::CreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**)+248)
#14  /system/lib64/libvulkan.so (vulkan::api::(anonymous namespace)::CreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**)+160)
#15 libVulkanLayerValidationTests.so (vkt::ShaderModule::init(vkt::Device const&, VkShaderModuleCreateInfo const&)+168)
#16 libVulkanLayerValidationTests.so (VkShaderObj::InitFromGLSL(void const*)+224)
#17 libVulkanLayerValidationTests.so (VkShaderObj::VkShaderObj(VkRenderFramework*, char const*, VkShaderStageFlagBits, spv_target_env, SpvSourceType, VkSpecializationInfo const*, char const*, void const*)+268) 
#18 libVulkanLayerValidationTests.so (VkArmBestPracticesLayerTest_ComputeShaderBadSpatialLocalityTest_Test::TestBody()+296)
...

The full ndk-stack output is available: 008-ndk-stack-info.txt

The crash appears when using a Debug build with Android NDK 26.3. It does not appear when using a Release build with NDK 26.3, nor (using either a Release or a Debug build) with either NDK 25.2 or NDK 27.0.

Given that the code appears to run correctly in a Release build, that the crash is device-independent, and that the crash occurs during memory allocation, it's fairly likely that the compiler isn't the issue, and that that something in validation or SPIRV is causing memory corruption that happens to cause a validation crash when memory is laid out "just right". If Address Sanitizer is supported on Android, it might be helpful in uncovering such a corruption.

It's possible, though IMHO unlikely, that this is an unknown compiler bug that appeared in NDK 26 and disappeared in NDK 27, as symptoms like this are not listed as known issues: https://github.com/android/ndk/releases

To reproduce the problem, run a manual-Vulkan-ValidationLayers build with: http://tcubuser.lunarg.localdomain:8080/view/Manual/job/manual-Vulkan-ValidationLayers/build

  • BUILD_MODE: Debug
  • ANDROID_ARGS: --android-ndk 26.3
  • NODE: tcubuand1

lunarpapillo avatar Aug 22 '24 21:08 lunarpapillo

For reference, original chat is: https://chat.google.com/room/AAAAOXVAYGg/FL0Vh98x-gM/FL0Vh98x-gM?cls=10

lunarpapillo avatar Aug 22 '24 21:08 lunarpapillo

tests crash on all devices in the same place in VkArmBestPracticesLayerTest.ComputeShaderBadSpatialLocalityTest,

This is 99% because VkArm is alphabetically first and it will crash in any test

spencer-lunarg avatar Aug 23 '24 01:08 spencer-lunarg

I was working on a minimal repro case and got it down to this, note that I'm not even creating a Vulkan instance:

TEST_F(PositiveTooling, Issue8439) {
    std::vector<uint32_t> spv = {
        0x07230203, 0x00010000, 0x0008000b, 0x00000019, 0x00000000, 0x00020011, 0x00000001, 0x0006000b, 
        0x00000001, 0x4c534c47, 0x6474732e, 0x3035342e, 0x00000000, 0x0003000e, 0x00000000, 0x00000001, 
        0x0005000f, 0x00000005, 0x00000004, 0x6e69616d, 0x00000000, 0x00060010, 0x00000004, 0x00000011, 
        0x00000008, 0x00000008, 0x00000001, 0x00030003, 0x00000002, 0x000001c2, 0x00040005, 0x00000004, 
        0x6e69616d, 0x00000000, 0x00040005, 0x00000009, 0x756c6176, 0x00000065, 0x00050005, 0x0000000d, 
        0x6d615375, 0x72656c70, 0x00000000, 0x00040047, 0x0000000d, 0x00000022, 0x00000000, 0x00040047, 
        0x0000000d, 0x00000021, 0x00000000, 0x00040047, 0x00000018, 0x0000000b, 0x00000019, 0x00020013, 
        0x00000002, 0x00030021, 0x00000003, 0x00000002, 0x00030016, 0x00000006, 0x00000020, 0x00040017, 
        0x00000007, 0x00000006, 0x00000004, 0x00040020, 0x00000008, 0x00000007, 0x00000007, 0x00090019, 
        0x0000000a, 0x00000006, 0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000001, 0x00000000, 
        0x0003001b, 0x0000000b, 0x0000000a, 0x00040020, 0x0000000c, 0x00000000, 0x0000000b, 0x0004003b, 
        0x0000000c, 0x0000000d, 0x00000000, 0x00040017, 0x0000000f, 0x00000006, 0x00000002, 0x0004002b, 
        0x00000006, 0x00000010, 0x3f000000, 0x0005002c, 0x0000000f, 0x00000011, 0x00000010, 0x00000010, 
        0x0004002b, 0x00000006, 0x00000012, 0x00000000, 0x00040015, 0x00000014, 0x00000020, 0x00000000, 
        0x00040017, 0x00000015, 0x00000014, 0x00000003, 0x0004002b, 0x00000014, 0x00000016, 0x00000008, 
        0x0004002b, 0x00000014, 0x00000017, 0x00000001, 0x0006002c, 0x00000015, 0x00000018, 0x00000016, 
        0x00000016, 0x00000017, 0x00050036, 0x00000002, 0x00000004, 0x00000000, 0x00000003, 0x000200f8, 
        0x00000005, 0x0004003b, 0x00000008, 0x00000009, 0x00000007, 0x0004003d, 0x0000000b, 0x0000000e, 
        0x0000000d, 0x00070058, 0x00000007, 0x00000013, 0x0000000e, 0x00000011, 0x00000002, 0x00000012, 
        0x0003003e, 0x00000009, 0x00000013, 0x000100fd, 0x00010038, 
    };

    spv_target_env spirv_environment = SPV_ENV_VULKAN_1_0;
    spv_context ctx = spvContextCreate(spirv_environment);
    spvtools::ValidatorOptions spirv_val_options;
    spv_const_binary_t binary{spv.data(), spv.size()};
    spv_diagnostic diag = nullptr;
    
    const spv_result_t spv_valid = spvValidateWithOptions(ctx, spirv_val_options, &binary, &diag);
    ASSERT_TRUE(spv_valid == SPV_SUCCESS);
   
    spvDiagnosticDestroy(diag);
    spvContextDestroy(ctx);
}

Weird thing is that if I add the same test to the SPIRV-Tools unit tests, it works fine! Same SPIRV-Tools commit, same CMake flags, same NDK.

mikes-lunarg avatar Aug 26 '24 15:08 mikes-lunarg

Weird thing is that if I add the same test to the SPIRV-Tools unit tests, it works fine! Same SPIRV-Tools commit, same CMake flags, same NDK.

Do the SPIRV-Tools unit tests also run on Android?

lunarpapillo avatar Aug 26 '24 18:08 lunarpapillo

By default, SPIRV-Tools tests do not run on Android. I was able to run them by commenting out these lines: https://github.com/KhronosGroup/SPIRV-Tools/blob/main/CMakeLists.txt#L315-L317 and then manually pushing and running the test executable using the adb shell.

mikes-lunarg avatar Aug 26 '24 18:08 mikes-lunarg

Weird...

const spv_result_t spv_valid = spvValidateWithOptions(ctx, spirv_val_options, &binary, &diag);
ASSERT_TRUE(spv_valid == SPV_SUCCESS);

spvDiagnosticDestroy(diag);
spvContextDestroy(ctx);

I presume the crash occurs in spvValidateWithOptions(), as it seems to with the VVL tests, and the stack trace is otherwise similar; I presume you were also running the test in isolation via --gtest_filter, yes?

Since it works in SPIRV-Tools unit tests, do you have an hypothesis as to why it fails deterministically in VVL? I've got nothing...

lunarpapillo avatar Aug 26 '24 18:08 lunarpapillo

I presume the crash occurs in spvValidateWithOptions(), as it seems to with the VVL tests, and the stack trace is otherwise similar; I presume you were also running the test in isolation via --gtest_filter, yes?

Yes and yes. And just like your initial writup, this only affects the Debug build. Release builds make it past the the spvValidateWithOptions() call and pass the assert.

Since it works in SPIRV-Tools unit tests, do you have an hypothesis as to why it fails deterministically in VVL? I've got nothing...

No real hypothesis yet. The fact that the test code works in one build (SPIRV-Tools) and not the other (VVL) makes me suspect something about how we build/package libSPIRV-Tools

mikes-lunarg avatar Aug 26 '24 19:08 mikes-lunarg

Similar issue: https://github.com/KhronosGroup/glslang/issues/3534

That reporter traced it back to a specific constructor for std::vector and patched around it by constructing the vector using a different method

mikes-lunarg avatar Aug 28 '24 17:08 mikes-lunarg