ncnn Question about data type and layout conversion between GPU and CPU

detail | 详细描述 | 詳細な説明

Hi.

I'm currently working with NCNN and I have some questions about data type and layout conversion between GPU and CPU.

At page 7 in this slide, NCNN vulkan it looks like NCNN uses [c/4, h, w] data layout for CPU in ARMv7 architectures and also for GPU (mobile).

In general mobile environments, does NCNN use the same tensor layout between the CPU and GPU contexts when performing inference across both processors?

If so, does this mean there's no additional overhead for layout transformation (e.g., latency or memory cost) when switching between CPU and GPU execution?

Thank you.

May 26 '25 03:05 alstjd025

Data types and layouts depend on the properties of the cpu and gpu, as well as the user's option settings. This is usually different on the cpu and gpu.

For example, armv7 uses [c/4,h,w] fp32 for channels divisible by 4, while the gpu uses [c/4,h,w] but fp16 type because the gpu has fast fp16 unpack capabilities. Even if the user disables use_fp16_packed with the option setting, so that the gpu uses the same data types and layout as the cpu, yes, it is exactly the same. There will also be copies between the cpu and gpu, at this time in the unified memory architecture, usually integrated graphics, optimized to a single memcpy, and on discrete graphics there will be 2 copies, host ➡️ staging buffer ➡️ device.

ncnn will not directly use the cpu data directly in the gpu pipeline, because the gpu storage has more stringent alignment requirements and a different life cycle than host memory.

If the caller can maintain the life cycle of the GPU memory by itself, then vkmat has mapped() / mapped_ptr() to directly access the GPU memory on the integrated graphics card, or the staging buffer memory on the discrete graphics card. This can completely avoid memcpy

May 26 '25 07:05 nihui

@nihui Thank you for the kind and fast response 👍.

I'm carefully looking into the sources and I found these at src/allocator.cpp.

// line 740
  buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);

.
.
.

// line 1896
// setup memory type
    if (buffer_memory_type_index == (uint32_t)-1)
    {
        buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, VK_MEMORY_PROPERTY_HOST_CACHED_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);
    }

As I understand from your answer, if CPU and GPU uses the exactly same type (e.g., [c/4,h,w]), there still one copy occurs between CPU and GPU even if in unified memory architecture.

To the best of my knowledge, some Vulkan setup "bits(or flags)" are used for unified buffer allocation which is visible and accessable/modifiable for both CPU and GPU in unifed memory architectures, such as common mobile SoCs and I got some from NCNN codes.

In the line 740, the VK buffer allocation logic seems create "GPU dedicated buffer" which is "visible" for host but not "modifiable" for host and this looks like the buffer for intermediates (output of a single layer and input of a single layer, or an operator). However, in the line 1896, it allocates buffer for "weights" (I'm not very sure but the name of the class is VkWeightStagingAllocator) as "CPU-GPU unified buffer" which is both accessible and modifiable for CPU and GPU.

OK, to avoid some confusion, let me summarize my extra questions here.

Why doesn't NCNN allocate unified buffers for intermediate tensors? Given that the intermediate activations (i.e., outputs of layers or operators) are transferred between CPU and GPU during inference, wouldn't it be more reasonable to use unified buffers for them instead of just for weights?
You mentioned earlier that even with a unified memory architecture, one copy still occurs due to stringent data layout alignment requirements. If that is the main reason, could you point out where the layout transition actually happens in the code, and briefly explain how it is implemented?
Apart from the [c/4, h, w] layout, are there any other specific data layouts that are implicitly required or enforced by mobile GPUs, particularly for taking advantage of the texture cache or image-based memory access?

Sorry for the long question — I really appreciate your help and any insights you can share. 🙏

May 27 '25 04:05 alstjd025