Shaomu
Shaomu
> Thanks for your interest in my work! > > As a sanity check step, can you try training `bert-base-multilingual-uncased` with grad cache _disabled_ and compare memory usage against some...
Another question: Does the `train_dense_retriever` support multi-gpu training as well? Coz m-bert requires more memories, I think using multiple-GPU might helps. I tried to use `python -m torch.distributed.launch --nproc_per_node=4 train_dense_retriever.py`...
> Could you try sharing the full log? > > Meanwhile here is a checklist for things to test out: > > * Make sure you have a valid display...
> Could you try sharing the full log? > > Meanwhile here is a checklist for things to test out: > > * Make sure you have a valid display...
> Have you checked out the `get_batch_scores` method yet? It sounds like this might be what you're looking for. I think `get_batch_scores` is to compute the bm25 scores between one...
> @Smu-Tan @puzzlecollector were you able to find an alternative to this implementation to speed up the process? checkout Pyserini.
three "solutions" work for my case: 1. use zero2 + bf16, instead of zero2 offload + bf16; 2. use fp16 than bf16 (works for zero2 offload); 3. change the source...
@waterluck Not sure if it helps, but probably check [this ](https://huggingface.co/blog/zh/deepspeed-to-fsdp-and-back).
Same problem when using multi-node, the job stuck when initializing critic model: `[36m(WorkerDict pid=826494)[0m Qwen2ForTokenClassification contains 13.99B parameters [36m(WorkerDict pid=826494)[0m Before critic FSDP, memory allocated (GB): 0.00, memory reserved (GB):...