hydrated(tutor:张银奎)ic*b*rg r*bb*sh
hydrated(tutor:张银奎)ic*b*rg r*bb*sh
Have successfully compiled a version of apex on A100 hardware. But when running the test of fmha. It took 21 seconds to finish. What could be the cause?
In many CUDA related project, we can see the line of code as the following: ``` extern "C" __device__ uint32_t __nvvm_get_smem_pointer(void *ptr); ``` It is used to convert the shared...
By referencing [here](https://github.com/adityaatluri/gemm-vega64/blob/master/shared_ops.h), wrote the following inline assembly code: ``` inline __device__ void sts(uint32_t ptr, uint4 val) { asm volatile("DS_WRITE_B128 %0, %1;\n" : : "v"(ptr) , "v"(val)); } ``` But...
Have followed [here](https://github.com/RadeonOpenCompute/hcc/issues/693) to write the following code: The hip file: ``` #include "hip/hip_runtime.h" #include "hip/hcc_detail/device_library_decls.h" __global__ void halfVec_v_pk_sts_then_lds( uint16_t * dst , uint32_t * ptr , uint16_t * val...
Followed the instructions and have run the the following commands: ` ./script/eval_all.sh ./script/eval_all_deterministic.sh ` in section **Evaluate LMTraj-SUP** [here](https://github.com/inhwanbae/lmtrajectory), but did not get w/ image results. Is there anything missing?
On the Readme of this project, the links to both TensorFlow and PyTorch version of pretrained model weights are provided. But After an inspection of both of them, it seems...
Have followed the instructions on Readme.md to have configured the environment, but didn’t get the results announced in the paper by just run: `python gtsrb_visualize_example.py` The produced results as gtsrb_visualize_(mask/pattern/fusion)_label_x.png...