TheTinyTeddy

Results 7 issues of TheTinyTeddy

Hi, thank you for the great work! I was wondering why the precision used for CogVideoX is FP16, whereas other T2V models such as Open-Sora and Open-Sora-Plan use BF16. Also,...

Hi, thank you for the great work! I was wondering is DeepSpeed-Ulysses the sequence parallel method used in both inference/training of Open-Sora-Plan v1.2.0? (As a side note, I think you...

Thank you for this amazing work! I was wondering if the fp8 implementation of flash attention 3 will be able for public to use? My main concern will be accuracy...

From the description of Q4_K: “4-bit quantization (q). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit), resulting in 4.5 bits-per-weight.”...

I have tried nvfp4 training which converges on sm120, but fp8blockscaled recipe won't converge for any of its available options. Is it because of power of 2 scale (cannot be...

For SM120, with disable_rht=True being set in transformer_engine.common.recipe.NVFP4BlockScaling, the code works fine, but when disable_rht=False is set, the code below will result in cuda error with te.fp8_autocast(enabled=True, fp8_recipe=fp4_recipe): out_fp4 =...

Many thanks for the great work! In the paper https://arxiv.org/pdf/2502.20853 they use 1D weight quantization with requantization with success, and also from their repo https://github.com/thu-ml/TetraJet-MXFP4Training/issues/2#issuecomment-3454394125 the author mentioned from their...