M issues

Results 8 issues of

kernel3

您好，我就用中文提问了呀。在kernel3中，你把blocksize从(32,32)改为(1024),这种做法的优点你说有3点好处:1.storing threadIdx.x before re-using it massively 2. in order to reduce living registers 3. benefit the compiler optimization 这几点我都不太懂是啥意思。在书中和网上都找不到对应的解释，能麻烦您能说的详细一些吗? 如果还能给出参考资料那也是最好不过的！

Can I add option in a callback function?

Can I add option in a callback function? such as :` CLI::App app{"argparse"}; parameter = app.add_subcommand("parameter", "")->parse_complete_callback([&](){ std::string path; parameter->add_option("--path", path, "the absolute path of files"); }); CLI11_PARSE(app, argc, argv);`...

[QST] how can I use copy swizzle to deal with non-consecutive elements？

Assume that there are 4x4 elements in the shared memory. I can use composition(Swizzle{},...) to swizzle each element successfully， But now I want to swizzle in unit of 2x2 elements,...

question

? - Needs Triage

where is flash decoding second stage (reduce) code ?

according to [https://pytorch.org/blog/flash-decoding/](url) , flash decoding is dual stage, the second stage is "reduce && rescale contribution of each split", but I can't find the reduce kernel after kernel "compute_attn_1rowblock_splitkv",...

Why not [RSC] but [C/64, R, S, 64] in kloop of conv implicit gemm?

suppose R=2, S=2, C=128, ThreadBlockShape=. When main loop in implicit GEMM_K, the memory access sequence in cutlass will be r=0,s=0,c=0-63, r=0,s=1,c=0-63, r=1,s=0,c=0-63, r=1,s=1,c=0-63, r=0,s=0,c=63-127, r=0,s=1,c=63-127, r=1,s=0,c=63-127, r=1,s=1,c=63-127. But why not...

question

? - Needs Triage

inactive-30d

[QST]why 3090 get different result with 4090 or 3060 when call get<0>(tensor)

I encountered a strange phenomenon. for code below: ``` auto l = make_identity_tensor(make_shape(_32{}, 64)); using T_V = cutlass::AlignedArray; using v_g2r_copy_op = UniversalCopy; using v_g2r_traits = Copy_Traits; using v_g2r_copy_atom = Copy_Atom;...

question

? - Needs Triage

inactive-30d

inactive-90d

[QST]Why cp.async.ca will influence bank conflcit of Shared Store From Global Load ?

when I rewrite cutlass conv with cute api, I encountered an interesting phenomena: During global to smem stage，I can't reduce the bank conflict using cute swizzle.After some research, I find...

question

? - Needs Triage

inactive-30d

[QST]when cutlass support drive thor?

**What is your question?** Has cutlass supported drive thor? If not, what time it will support

question

? - Needs Triage