terrificdm

Results 5 issues of terrificdm

For single GPU training, every time I run the script, I have to "export SLURM_LOCALID=0", "export SLURM_PROCID=0" and "export SLURM_NNODES=1" before I start the training successfully. My question is for...

By changing "generation_config.json"? or any other flexible method? Thanks.

## :rocket: Feature Request ### General Information * [X] :wave: I may be able to implement this feature request * [ ] :warning: This feature might incur a breaking change...

feature-request
effort/medium
p2

I use one NVIDIA L40S(48GB VRAM) to train a Lora for Flux, and here is my training script: `./sd-scripts/flux_train_network.py --pretrained_model_name_or_path ./model/flux1-dev.safetensors --clip_l ./model/clip_l.safetensors --t5xxl ./model/t5xxl_fp16.safetensors --ae ./model/ae.safetensors --cache_latents_to_disk --save_model_as safetensors...

I'm curious about the differences in networking requirements between EP and TP, particularly regarding latency, bandwidth, and throughput. I know that both distribution strategies are highly demanding in terms of...