toufunao issues

Results 8 issues of


                                            toufunao

How to run a oneshot strategy such as DARTS in data-parallel with multiple GPUs?

**Describe the issue**: I want to run DARTS examples in multiple GPUs, so I wrapped the model with DDP and shared data with Distributedsampler. However, I found the final 2...

ImportError: Cannot use a path to identify something from main.

**Describe the issue**: When I tried to use NAS, it reported 'ImportError: Cannot use a path to identify something from __main__.' and 'ValueError: Pickle too large when trying to dump...

Question about which party should be arbiter?

In hetero-LR settings, a role arbiter is needed. I would like to know which party should I assign this role. In the tutorial, some choose the guest to be the...

Using 2 GPUs training to train DARTS in parallel, but get 2 different search architecture?

**Describe the issue**: I used 2 GPUs to train DARTS, but from the output, I find that I get 2 different results. And I used 'export_onnx', but I didn't get...

NAS 2.0

[BUG] 纯内网环境安装pycocotools失败

**问题描述 / Problem Description** 纯内网环境安装pycocotools依赖失败 **环境信息 / Environment Information** 操作系统：红帽商业版7.7 python：3.10.9 **附加信息 / Additional Information** 添加与问题相关的任何其他信息 / Add any other information related to the issue.

bug

issue with example_completion.py

when i tried codellama-7b and codellama-34b to test code completion, all results were garbled code. facilities: OS: Red hat 4.8.5-36 GCC:4.8.5 32G V100 cuda:11.7 torch: 2.0.0 fairscale 0.4.13 sentencepiece: 0.1.99...

TensorParallel object has no attribute save_pretrained

I used tensor_parallel to finetune qwen model with lora in tensor parallel way. However, it cannot save the model in the end. Any help can you provide? Thanks.

steps如何计算的

我使用了以下脚本进行训练，数据集大小约为33000条数据，per_device_batch_size=16，gradient_accumenlation_steps=32，epochs=3，4张GPU。 nproc_per_node=4 NPROC_PER_NODE=$nproc_per_node \ CUDA_VISIBLE_DEVICES=0,1,2,3 \ swift pt \ --model Qwen/Qwen2.5-7B \ --train_type full \ --dataset $CUSTOM_DATASET \ --torch_dtype bfloat16 \ --num_train_epochs 3 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 1 \...