Po Yen Chen

Results 13 issues of Po Yen Chen

While adding new type of device operator instances, we also have to add corresponding `add_device_xxxx_instances()` declarations in the header. It's error-prone and time consuming. ```c++ // file: library/include/ck/library/tensor_operation_instance/gpu/gemm.hpp namespace ck...

The `ck::Array` and `std::array` behave same. And the only difference between those two types is that former has templated assignment operator. I think `ck::Array` can be used in most use...

There are lots of duplicated codes in implementations, like the `HostTensorDescriptor` creation logic. ```c++ auto f_host_tensor_descriptor1d = [](std::size_t len, std::size_t stride) { return HostTensorDescriptor({len}, {stride}); }; auto f_host_tensor_descriptor2d = [](std::size_t...

Currently we put headers into _include/**ck/xxxxx**_ sub-directories except _ckProfiler_ ```console $ tree library/include/ -L 3 library/include/ └── ck └── library ├── reference_tensor_operation ├── tensor_operation_instance └── utility $ tree profiler/include/ -L...

For the targets like _ckProfiler_, I found that existing source files and the `add_executable()` arguments are identical. We can see same symptom in the instance libraries: - First argument of...

Add new `fmha_fwd_appendkv()` API which runs ahead the `fmha_fwd()`/`fmha_fwd_splitkv()` API. The `fmha_fwd_appendkv()` + `fmha_fwd()`/`fmha_fwd_splitkv()` combination implement the functionality of `mha_fwd_kvcache()` in FA 2.5 (without paged-kvcache part)