TZeng20
TZeng20
Hey @Lokiiiiii so is the aim of your feature to add a distribution argument for multi worker mirrored strategy in Sagemaker? i.e. `estimator = sagemaker.tensorflow.TensorFlow(entry_point='script.py', ..........distribution = {})` What would...
Does anything else apart from the `TF_CONFIG` need to be configured to make multiworkermirrored work? Let's say I use 2 instances of ml.g4dn.8xlarge which has only 1 gpu per machine....
I see, thanks. To load a Flan-t5-xxl model (11b params, ~45 gb) on 4 processes, I would need roughly 180 gb of ram? Just to clarify, does num_processes have to...
In the code above, I am using FSDP but still getting the sigkill error. So, is this still due to CPU RAM? Can I use deepspeed in addition to FSDP?
What is the correct method to use for larger datasets or embeddings with large dimensions? And would you use asynchronous execute_write for this?