warp icon indicating copy to clipboard operation
warp copied to clipboard

[REQ] Support explicit struct memory alignment via decorator

Open WenchaoHuang opened this issue 7 months ago • 3 comments

Description

It would be great to support a decorator like @struct_align(alignment: int) — or alternatively, accept an 'alignment' parameter in struct definitions — to enforce explicit memory alignment for user-defined structs.

This feature would allow developers to precisely control memory layout, aligning with native CUDA/C++ behaviors such as __align__(n) or alignas(n).

Example:

@wp.struct:
class IntFloat:
    index: int
    value: float

@wp.struct_align(8):
class IntFloat:
    index: int
    value: float

@wp.struct(alignment = 8):
class IntFloat:
    index: int
    value: float

Context

Explicit alignment control is often necessary when interoperating with native code, optimizing memory access patterns, or ensuring correct data layout for GPU memory operations. Although such an interface might be less Pythonic and not very friendly for typical Python users, it can bring performance benefits in certain specialized scenarios.

WenchaoHuang avatar Jul 08 '25 16:07 WenchaoHuang

@WenchaoHuang can you provide specific examples that either wouldn't work or would be slower without this feature?

Note that wp.array can take a strides argument to ensure proper spacing between elements. That said, it doesn't ensure that each element itself is aligned. I assume the expectation is that arrays created with a dtype of a struct with an alignment property would adopt that alignment?

One thing that might cause a limitation is that ctypes.Structure does not support alignment (yet). But I'd like to better understand the scenarios where this feature would come in useful.

c0d1f1ed avatar Oct 10 '25 22:10 c0d1f1ed

@WenchaoHuang can you provide specific examples that either wouldn't work or would be slower without this feature?

Note that wp.array can take a strides argument to ensure proper spacing between elements. That said, it doesn't ensure that each element itself is aligned. I assume the expectation is that arrays created with a dtype of a struct with an alignment property would adopt that alignment?

One thing that might cause a limitation is that ctypes.Structure does not support alignment (yet). But I'd like to better understand the scenarios where this feature would come in useful.

In C++, the alignas (or __align__ in CUDA) specifier for structs only provides some optimization hints to NVCC and does not affect device memory allocation. Theoretically, it is the developer's responsibility to ensure that the device memory address meets the alignment requirements of the structure. However, according to the CUDA Programming Guide, addresses returned by cudaMalloc are guaranteed to be 512-byte aligned. Therefore, directly using the address returned by cudaMalloc can satisfy the alignment requirements for most structures, unless a custom memory layout is used (which is not a consideration for the current Warp implementation).

When a struct with a size of 2/4/8/16 bytes is annotated with __align__(2/4/8/16), NVCC will use vectorized load instructions instead of individual element-by-element loading when accessing such structures. This optimization can reduce the number of instructions and potentially improve cache hit rates, particularly during random access patterns (though this effect becomes less significant on newer GPU architectures).

Below is an example code snippet I wrote using warp instructions to demonstrate the performance difference between aligned and unaligned memory accesses. You can directly profile this kernel using either NSight Systems or NSight Compute.

import numpy as np
import warp as wp



@wp.kernel
def random_access_without_align_kernel(
    data_in: wp.array(dtype=wp.vec4),
    indices: wp.array(dtype = int),
    # outputs
    data_out: wp.array(dtype=wp.vec4),
):
    tid = wp.tid()

    data_out[tid] = data_in[indices[tid]]



@wp.func_native("""
	float4 val;
	asm volatile(
		"ld.global.v4.f32 {%0, %1, %2, %3}, [%4];"
		: "=f"(val.x), "=f"(val.y), "=f"(val.z), "=f"(val.w)
		: "l"(data + i)
	);
	return reinterpret_cast<wp::vec4&>(val);
""")
def vec4_load(data: wp.array(dtype=wp.vec4), i: int) -> wp.vec4: pass



@wp.kernel
def random_access_with_align_kernel(
    data_in: wp.array(dtype = wp.vec4),
    indices: wp.array(dtype = int),
    # outputs
    data_out: wp.array(dtype = wp.vec4),
):
    tid = wp.tid()

    data_out[tid] = vec4_load(data_in, indices[tid])



count = 1000000
random_indices = np.array([i for i in range(count)], dtype=np.int32)
np.random.shuffle(random_indices)

wp.init()
verts_in = wp.zeros(count, dtype = wp.vec4)
verts_out = wp.zeros(count, dtype = wp.vec4)
indices = wp.array(random_indices, dtype=int)


for _ in range(10):
    wp.launch(
        random_access_without_align_kernel,
        dim=[count],
        inputs=[verts_in, indices],
        outputs=[verts_out],
        block_dim = 128,
    )

for _ in range(10):
    wp.launch(
        random_access_with_align_kernel,
        dim=[count],
        inputs=[verts_in, indices],
        outputs=[verts_out],
        block_dim = 128,
    )

wp.synchronize()

I usewp.vec4 as the example here (for simplicit, otherwise, it would be difficult to use it within func_native), as it's equivalent to the following Warp struct definition.

@wp.struct
class MyVec4:
    x: float
    y: float
    z: float
    w: float

Since this feature is currently unimplemented, I had to manually implement equivalent PTX code using func_native (if the feature were implemented, NVCC would automatically compile it into an aligned-access version). On my RTX 4090, aligned access is 10-15% faster than unaligned access.

WenchaoHuang avatar Oct 13 '25 07:10 WenchaoHuang

Thanks for providing these additional details!

It's quite straightforward to implement an alignment parameter for @wp.struct that translates into an alignas() specifier in the CUDA code. I just want to be careful of unintended consequences of exposing this directly. We want to make it hard to write incorrect code, or at least make it easy to detect when something goes wrong and assist the developer in resolving it.

c0d1f1ed avatar Oct 17 '25 19:10 c0d1f1ed