Question - Anyone Trained Retrieval Distributed Successfully?

Open BrianMiner opened this issue 1 year ago • 0 comments

I am curious is anyone out here has successfully run a basic two tower retrieval model distributing using Horovod or any other method? I am seeing consistently poor results training in a distributed fashion and better with a single GPU (although not incredible). For the Horovod method I am sharding the interaction data across 8 GPUs. Has anyone seen good performance distributing this computation?

Mar 18 '24 20:03 BrianMiner