TZeng20 comments

Results 5 comments of


                                            TZeng20

Support different tf.distribute.Strategies for distributed training on SageMaker

Hey @Lokiiiiii so is the aim of your feature to add a distribution argument for multi worker mirrored strategy in Sagemaker? i.e. `estimator = sagemaker.tensorflow.TensorFlow(entry_point='script.py', ..........distribution = {})` What would...

Support different tf.distribute.Strategies for distributed training on SageMaker

Does anything else apart from the `TF_CONFIG` need to be configured to make multiworkermirrored work? Let's say I use 2 instances of ml.g4dn.8xlarge which has only 1 gpu per machine....

Failing to load model using accelerate launch

I see, thanks. To load a Flan-t5-xxl model (11b params, ~45 gb) on 4 processes, I would need roughly 180 gb of ram? Just to clarify, does num_processes have to...

Failing to load model using accelerate launch

In the code above, I am using FSDP but still getting the sigkill error. So, is this still due to CPU RAM? Can I use deepspeed in addition to FSDP?

upsert_vectors() shouldn't use element IDs?

What is the correct method to use for larger datasets or embeddings with large dimensions? And would you use asynchronous execute_write for this?