executorch icon indicating copy to clipboard operation
executorch copied to clipboard

YOLO-NAS model diverges with Vulkan backend

Open lsauerr opened this issue 2 months ago • 11 comments

🐛 Describe the bug

Hello,

I am trying to run a YOLO-NAS model on a GPU using a Vulkan backend.

I can lower and export the model to run it with the executor_runner, however, the first tensor on the outputs which represent the boxes coordinates diverge. This behavior is not consistent. If I execute the same command line different outputs may be produced, even if I use the same input file. The command I am using is the following one:

./executor_runner --model_path=YOLO_NAS_S_vulkan.pte --inputs=dummy_input.bin

I did a minimal reproduction using the export.py available in executorch/examples/vulkan. I will attach the diff for you to reproduce it. To run it:

 python export.py --model=YOLO_NAS_S

I believe that @SS-JIA is the most suitable person to provide support for this.

export_diff.txt

Many thanks for your support.

Versions

Collecting environment information... PyTorch version: 2.9.0+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 22.0.0git CMake version: version 3.31.6 Libc version: glibc-2.35

Python version: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-60-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA RTX 2000 Ada Generation Nvidia driver version: 570.133.07 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i9-14900 CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 CPU max MHz: 5800.0000 CPU min MHz: 800.0000 BogoMIPS: 3993.60 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 896 KiB (24 instances) L1i cache: 1.3 MiB (24 instances) L2 cache: 32 MiB (12 instances) L3 cache: 36 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Mitigation; Clear Register File Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] executorch==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] onnx==1.15.0 [pip3] onnxruntime==1.15.0 [pip3] onnxsim==0.4.36 [pip3] pytorch-lightning==2.5.5 [pip3] pytorch_tokenizers==1.0.1 [pip3] torch==2.9.0 [pip3] torch-complex==0.4.4 [pip3] torchao==0.14.0+git01849b2b1 [pip3] torchaudio==2.9.0 [pip3] torchmetrics==0.8.0 [pip3] torchvision==0.24.0 [pip3] triton==3.5.0 [conda] Could not collect

cc @SS-JIA @manuelcandales @digantdesai @cbilgin

lsauerr avatar Nov 10 '25 15:11 lsauerr

@lsauerr thanks for reporting!

Could you provide one or both:

  1. the .pte file you are testing with
  2. an export script that I can use to produce the .pte file you are testing with

This will help me reproduce the issue and investigate further. Thanks in advance!

SS-JIA avatar Nov 10 '25 15:11 SS-JIA

Thanks for the quick reply. I only extended what you did for the Vulkan examples. I put the diff of mine and yours in here. I can't send the *.pte file, I am receiving a message that this file type is not supported.

To run it:

 python export.py --model=YOLO_NAS_S

Many thanks for the support!

lsauerr avatar Nov 10 '25 16:11 lsauerr

@lsauerr perfect, thanks. I will try to reproduce and will update when I have more news 👍

SS-JIA avatar Nov 10 '25 19:11 SS-JIA

@lsauerr I was able to reproduce an issue where the model was producing incorrect output, and found that it was an issue with the split_with_sizes operator. I have created a fix here: https://github.com/pytorch/executorch/pull/15793

If you are able, it would be great if you could try pulling the fix and checking to see if the issue persists. Thanks in advance!

P.S. For an easier repro, you can also use this PR https://github.com/pytorch/executorch/pull/15795 where I add YOLO_NAS_S as an option in export.py

git checkout gh/SS-JIA/371/head

Pulling that branch will include the fix as well. Once the branch is checked out, you can test via

# Install
./install_executorch.sh -e

# Run export script and test
MODEL_NAME=YOLO_NAS_S && \
python -m examples.vulkan.export --model_name=$MODEL_NAME -o . --test

SS-JIA avatar Nov 12 '25 23:11 SS-JIA

Many thanks for that!

For anyone following this issue in the future, use the following to install (as pointed here): CMAKE_ARGS="-DEXECUTORCH_BUILD_VULKAN=ON " ./install_executorch.sh -e

I did the process on your branch and the test passed. However, when executing it with the executor_runner, the values are still diverging (and thus presenting non-replicable results).

I used the following to build the executor_runner, inside the branch you told me to checkout:

cmake . \
     -DCMAKE_INSTALL_PREFIX=cmake-out \ 
     -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
     -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
     -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \
     -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
     -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
     -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
     -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
     -DEXECUTORCH_BUILD_TESTS=ON \
     -DEXECUTORCH_BUILD_VULKAN=ON \
     -DGLSLC_PATH=$(which glslc) \
     -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
     -Bcmake-out
cmake --build cmake-out -j16 --target executor_runner

To execute it:

cmake-out/executor_runner --model_path=YOLO_NAS_S_vulkan.pte

For what it is worth: - The behavior has been noticed for x86_64 architecture and for an aarch64. - I transformed the model using XnnpackPartitioner() instead of VulkanPartitioner() and executed it with executor_runner and the results did not diverge and were replicable across different runs. - I attached the produced output to show the divergence. You can notice it on the end of the first tensor, with values around e+30.

output.txt

I transformed the model using XnnpackPartitioner() instead of VulkanPartitioner() and executed it with executor_runner and the results did not diverge and were replicable across different runs.

Any ideas on the cause? Many thanks for investigating it.

lsauerr avatar Nov 13 '25 11:11 lsauerr

@lsauerr Unfortunately I'm not able to observe the divergence that you are seeing on my development machine. I actually tried on the main branch as well, and could not see the divergence (although the outputs did appear to be incorrect) so I think the divergence you are observing is coming from a separate issue entirely which is not being addressed by my fix.

Here are some things you can try. First, if you are exporting with the -fp16 option, try to turn that off. Next, try exporting without memory planning. You can do this by running:

MODEL_NAME=YOLO_NAS_S && \
python -m examples.vulkan.export --model_name=$MODEL_NAME -o . --skip_memory_planning 

Finally, I have posted my full repro instructions below - it may be helpful to try it out and compare with my uploaded results. If you have an Android device on hand, it may be helpful to test on that device as well to see if the issue may be something particular to the GPU you are testing on.

For the record, I am testing on a machine equipped with an NVIDIA A100, here is the output of nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA PG509-210               On  |   00000000:04:00.0 Off |                    0 |
| N/A   29C    P0             48W /  330W |       6MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Here are my full repro instructions (outputs attached in a separate comment) for your reference. Note that I have updated my branch to make testing a bit easier, in particular adding the ability to save input data to export.py.

# Checkout branch
git checkout gh/SS-JIA/371/head

# Install w/ Vulkan bindings
CMAKE_ARGS="-DEXECUTORCH_BUILD_VULKAN=ON" ./install_executorch.sh -e

# Export and save inputs
MODEL_NAME=YOLO_NAS_S && \                          
python -m examples.vulkan.export --model_name=$MODEL_NAME -o . --test --save_inputs

# Build for host
cmake . \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
    -DEXECUTORCH_BUILD_VULKAN=ON \
    -DEXECUTORCH_BUILD_TESTS=OFF \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_KERNELS_LLM=ON \
    -DEXECUTORCH_BUILD_KERNELS_LLM_AOT=ON \
    -DEXECUTORCH_BUILD_DEVTOOLS=ON \
    -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \
    -Bcmake-out && \
cmake --build cmake-out -j64 --target install

# Run executor runner on host
cmake-out/executor_runner --model_path=YOLO_NAS_S_vulkan.pte --inputs=input0.bin

# Build for Android
cmake . \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
    --preset "android-arm64-v8a" \
    -DANDROID_PLATFORM=android-28 \
    -DPYTHON_EXECUTABLE=python \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
    -DEXECUTORCH_PAL_DEFAULT=posix \
    -DEXECUTORCH_BUILD_VULKAN=ON \
    -DEXECUTORCH_BUILD_TESTS=OFF \
    -DEXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL=ON \
    -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \
    -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \
    -Bcmake-out-android-so && \
cmake --build cmake-out-android-so -j16 --target install --config Release

# Push artifacts
adb shell mkdir -p /data/local/tmp/etvk/models/yolonas && \
adb push YOLO_NAS_S_vulkan.pte /data/local/tmp/etvk/models/yolonas && \
adb push input0.bin /data/local/tmp/etvk/models/yolonas && \
adb push cmake-out-android-so/executor_runner /data/local/tmp/etvk

# Run on Android
PREFIX_PATH=/data/local/tmp/etvk/models/yolonas && \
adb shell /data/local/tmp/etvk/executor_runner \
  --model_path=$PREFIX_PATH/YOLO_NAS_S_vulkan.pte --inputs=$PREFIX_PATH/input0.bin

SS-JIA avatar Nov 13 '25 16:11 SS-JIA

Attached are the outputs I'm observing.

Output when testing on the main branch on my development machine: yolonas_main.txt

Output when testing on the gh/SS-JIA/371/head branch on my development machine: yolonas_with_fix.txt

Output when testing on the gh/SS-JIA/371/head branch, on an Android device: yolonas_android_with_fix.txt

SS-JIA avatar Nov 13 '25 16:11 SS-JIA

Re-uploading with stderr output as well so there's more context in the logs:

yolonas_with_fix.txt

yolonas_android_with_fix.txt

SS-JIA avatar Nov 13 '25 16:11 SS-JIA

Thank you very much for the detailed info and for the great support you're providing.

I tried building it on another workstation but I still faced a similar issue for the divergence, although less frequently. Can you please share which version of the LunarG SDK you are using when building the executor_runner? Specially the glslc. I am trying using the most recently released version: 1.4.328.1.

lsauerr avatar Nov 14 '25 12:11 lsauerr

Certainly. I'm using Vulkan SDK version 1.4.321.1 at the moment. My glslc version is

shaderc v2023.8 v2025.3
spirv-tools v2025.3 v2022.4-833-g33e02568
glslang 11.1.0-1253-gefd24d75

Target: SPIR-V 1.0

FWIW, I don't think the Vulkan SDK version would have much of an impact on the behaviour you are observing.

Btw, from the original post you mentioned the GPU on your machine is NVIDIA RTX 2000 Ada Generation. Is this the case with the other machines you are testing on as well?

SS-JIA avatar Nov 14 '25 16:11 SS-JIA

Thanks for that.

I used the same Vulkan SDK and you were right, nothing changed. The other machine I am testing it with has a NVIDIA GeForce RTX 3080 GPU. I don't have an android device to test it as you suggested, but I have observed the behavior on the embedded GPU as well, which is not NVIDIA.

FWIW, I lowered the YOLO-NAS application to execute on the embedded GPU with an alternative flow using ONNX → IREE and it did not diverge, but the performance was very poor (one order of magnitude worse in terms of inference time compared to ExecuTorch), so I decided to try ExecuTorch again after the new stable release.

lsauerr avatar Nov 15 '25 10:11 lsauerr

Sorry to hear that you are still experiencing the same issues. Note that I will be taking some vacation for the next two weeks, but once I am back I will be able to test on a laptop that has a NVIDIA RTX 3070 and another laptop that has a NVIDIA RTX 4080. Hopefully this will be closer to your setup and I will be able to replicate and investigate further.

Would you mind sharing what you see running nvidia-smi? I will try to match your driver versions if possible.

Also, just out of curiousity, would you also be able to share the embedded GPU that you tested on as well?

SS-JIA avatar Nov 17 '25 10:11 SS-JIA

Thank you very much for your efforts. Here are the two workstations I tested it with:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 2000 Ada Gene...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   25C    P8              7W /   70W |     112MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2303      G   /usr/lib/xorg/Xorg                       64MiB |
|    0   N/A  N/A            2535      G   /usr/bin/gnome-shell                     15MiB |
+-----------------------------------------------------------------------------------------+

And:

+-----------------------------------------------------------------------------------------+ 
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     | 
|-----------------------------------------+------------------------+----------------------+ 
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. | 
|                                         |                        |               MIG M. | 
|=========================================+========================+======================| 
|   0  NVIDIA GeForce RTX 3080 ...    On  |   00000000:01:00.0 Off |                  N/A | 
| N/A   49C    P0             26W /   80W |       1MiB /  16384MiB |      4%      Default | 
|                                         |                        |                  N/A | 
+-----------------------------------------+------------------------+----------------------+ 
                                                                                            
+-----------------------------------------------------------------------------------------+ 
| Processes:                                                                              | 
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory | 
|        ID   ID                                                               Usage      | 
|=========================================================================================| 
|  No running processes found                                                             | 
+-----------------------------------------------------------------------------------------+ 

The embedded GPU I am testing it with is an Arm GPU from the Mali family, valhall architecture.

I noticed that a model.etdump is generated when executing the application. I saw that this can be read with the Inspector API. Do you think I could find something in there that could provide some insight? Any advice to debug this further?

Enjoy your vacation!

lsauerr avatar Nov 17 '25 10:11 lsauerr

Hi @SS-JIA, I hope you had a good time during your vacation!

Did you have the time to try the same driver version to verify if the divergence is replicable?

As usual, thanks for your assistance.

lsauerr avatar Dec 17 '25 08:12 lsauerr

@lsauerr thanks for the reminder - I'm still in the process of setting up my NVIDIA GPU equipped laptop for development; I will follow up tomorrow with an update!

SS-JIA avatar Dec 18 '25 17:12 SS-JIA