Saaketh Narayan
Saaketh Narayan
@rishabhm12 @smilenaderi Are both of you using pre-processing functions for each sample/batch before training? Also, how big are your batch sizes and epoch sizes?
@miguelalba96 In the past we've seen that treating GCSFuse as "local" can be slow. Have you tried treating it as remote, or moving your data to local disk?
@miguelalba96 some things you could try, given that local disk works well instead of FUSE-mounted: * increase prefetch factor on dataloader, and predownload on dataset * Set remote to be...
I am curious, what launchers are people using? I have reproduced the issue of low utilization between epochs when using TorchDistributor, but the issue goes away with the Composer launcher.
Hey @rishabhm12 @Matagi1996 @miguelalba96 @smilenaderi -- @XiaohanZhangCMU was able to root cause and fix the hangs between epochs. In internal testing, this has resolved inter-epoch hangs and has improved overall...
Hey, this would be great! What did you have in mind regarding the implementation -- what should be done on Streaming's side?
Hey @lhoestq, @orionw added support for storing MDS datasets in huggingface. The relevant section in the docs is [here](https://docs.mosaicml.com/projects/streaming/en/stable/how_to_guides/configure_cloud_storage_credentials.html#huggingface-datasets). Will ask internally about posting on socials! @orionw provided this simple...
@lhoestq we tweeted here: https://x.com/DbrxMosaicAI/status/1818407826852921833 thanks!
Hey, we have seen index.json load times be slow. I think that this is because we download the index file on every single rank, rather than downloading it on just...
Hey @smspillaz and @jarnoseppanen-sc, thanks for raising this issue with us! The PR above (#672) addresses this bug by ensuring that the shard size limit can never go above 2**32...