recommenders
recommenders copied to clipboard
Question - Anyone Trained Retrieval Distributed Successfully?
I am curious is anyone out here has successfully run a basic two tower retrieval model distributing using Horovod or any other method? I am seeing consistently poor results training in a distributed fashion and better with a single GPU (although not incredible). For the Horovod method I am sharding the interaction data across 8 GPUs. Has anyone seen good performance distributing this computation?