Xiangchendong

Results 7 issues of Xiangchendong

why do not support loading from local dir but always download? May I pull request for this?

### Is your feature request related to a problem? Please describe. 非常赞赏学长们的工作!我有一个小小的问题注意到readme里有一个吞吐和显存占用的表格。BMtrain显著优于Deepspeed- megaton,我好奇这其中的优化主要来源于什么地方呢。同样的逻辑,为什么我们能够支持更多的bach size,吞吐更高?是否也有显卡配置的原因呢(sxm的机器是不是会因为高带宽抹去这样的差距)。 我觉得做到这样的优化绝对是系统顶会级别的工作,学长们有兴趣分析这其中的优化点并总结成文章投稿吗。 其实我非常希望使用BMtrain的框架,但是只看到其中的好,不知道为什么好,心里就不踏实。 ### Describe the solution you'd like 同上 ### Describe alternatives you've considered _No response_...

enhancement

Hello! Thank you very much for this, it seems really good! I am wondering how the style adapter trained, for example, I have van Gogh images, then the input of...

感谢你们出色的工作! 我有一个小问题: ![](https://github.com/NUS-HPC-AI-Lab/OpenDiT/blob/master/figure/end2end.png) 请问这张图的性能区别主要受益于什么机制呢(图中DIT的attention实现是什么,kernel fusion是否采用等),注意到单卡也能快两倍,所以速度变快主要可能不是来源于序列并行?有无消融实验表面主要的性能提升来自什么机制呢

**Is your feature request related to a problem? Please describe.** my truly minimal use case request: I have 2 datasets with resolutions 256 and 512, I want to build 2...

enhancement

after install tk, I test with: ```python import torch from thunderkittens import mha_forward # import IPython; IPython.embed() device = torch.device('cuda:0') x = torch.randn(2, 8, 1024, 64, device=device, dtype=torch.bfloat16) causal =...

the test code attention interface diffs from cpp kernel interface