M
M
kernel3
您好,我就用中文提问了呀。在kernel3中,你把blocksize从(32,32)改为(1024),这种做法的优点你说有3点好处:1.storing threadIdx.x before re-using it massively 2. in order to reduce living registers 3. benefit the compiler optimization 这几点我都不太懂是啥意思。在书中和网上都找不到对应的解释,能麻烦您能说的详细一些吗? 如果还能给出参考资料那也是最好不过的!
Can I add option in a callback function? such as :` CLI::App app{"argparse"}; parameter = app.add_subcommand("parameter", "")->parse_complete_callback([&](){ std::string path; parameter->add_option("--path", path, "the absolute path of files"); }); CLI11_PARSE(app, argc, argv);`...
Assume that there are 4x4 elements in the shared memory. I can use composition(Swizzle{},...) to swizzle each element successfully, But now I want to swizzle in unit of 2x2 elements,...
according to [https://pytorch.org/blog/flash-decoding/](url) , flash decoding is dual stage, the second stage is "reduce && rescale contribution of each split", but I can't find the reduce kernel after kernel "compute_attn_1rowblock_splitkv",...
suppose R=2, S=2, C=128, ThreadBlockShape=. When main loop in implicit GEMM_K, the memory access sequence in cutlass will be r=0,s=0,c=0-63, r=0,s=1,c=0-63, r=1,s=0,c=0-63, r=1,s=1,c=0-63, r=0,s=0,c=63-127, r=0,s=1,c=63-127, r=1,s=0,c=63-127, r=1,s=1,c=63-127. But why not...
I encountered a strange phenomenon. for code below: ``` auto l = make_identity_tensor(make_shape(_32{}, 64)); using T_V = cutlass::AlignedArray; using v_g2r_copy_op = UniversalCopy; using v_g2r_traits = Copy_Traits; using v_g2r_copy_atom = Copy_Atom;...
when I rewrite cutlass conv with cute api, I encountered an interesting phenomena: During global to smem stage,I can't reduce the bank conflict using cute swizzle.After some research, I find...
**What is your question?** Has cutlass supported drive thor? If not, what time it will support