cyz0202 issues

Results 6 issues of


                                            cyz0202

预训练无法使用gpu

你好，我用自己的数据预训练tiny，看起来只使用cpu在跑，环境设置如下： 1、device_list 可以看到1块cpu和4块gpu 2、tf版本：只有gpu版本和 gpu/cpu版本共存（保证gpu版本>=cpu），都试过 3、CUDA_VISIBLE_DEVICES 设置为已有gpu index 结果如下-cpu大量使用、gpu只使用100多M： /Users/zhangyang/Documents/Albert cpu使用情况.png

请教CUPY/CUDA

1. 您好，如图所述，我想查看 cupy操作cuda的函数的具体定义和用法，但是可能是因为cupy封装了c/c++代码，所以看不到，请问可以去哪里看呢？能帮忙解释一下图中第3个参数 routine 里面 4个函数执行顺序吗（我了解大概是创建结构体、计算对称量化的scale）跳到定义处，就只有这样的doc ------------------------ 2. 请问下图红框内为什么那样写？ ------------------------- **3. 想问一下为什么选择使用cupy直接操作cuda呢，比如allocator、igemm、fgemm的应用？这样相比使用框架（如pytorch等）实现量化有更大的好处吗？感觉cupy+cuda实现方式要求挺高的** **非常感谢** @a710128

where is the LSH attention?

hi, I don't find the implementation of the LSH attention. Is it in progress? thx

inference time

hi, I am in a puzzle about the inference time of the compressed model. Why is the compressed model more time consuming? Shouldn't it be faster with fewer parameters(about half...

有效节省显存的原因

哈喽感谢开发xtuner这么强的工具！有个显存消耗的问题想咨询一下。问题：lora微调qwen2-7b时，测试显存占用；主要看model(**input)这一步，即前向的显存占用，发现比理论分析的少了一半；具体来说，理论上一次前向fp16精度下所有激活应该占用接近30个G左右显存，但是xtuner只占用了一半；看参数并没有开启gradient checkpointing（activation checkpointing），那是怎么做到显存节省的呢？期待回答，非常感谢！

[Bug]: last gpu OOM when use pipeline parallelism with 2 nodes x 8cards each node

### Your current environment The output of `python collect_env.py` ```text INFO 03-02 22:22:26 __init__.py:190] Automatically detected platform cuda. Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used...

bug