Omkar Pangarkar
Omkar Pangarkar
Here we update the higher limit on local batchsize when we are hit with an OOM. The upper limit is constrained by `LOCAL_BSZ_CUTOFF_PCT` of current local batchsize. We have to...
`export ADAPTDL_SUBMIT_REPO=registry.foo.com/dev/adaptdl-submit:latest` and then `adaptdl submit` results in job failure with `Message: PodTemplate "resnet18-cifar10-v2-elastic-5b4xl" is invalid: template.spec.containers[0].image: Required value`. Workaround: Turns out removing the `latest` tag from the env variable...
Current AdaptDL controller is a mediator between what the allocator wants and what the k8s default scheduler has or can do. It tries to reconcile the jobs states so that...
> str(bar) > 'data.frame': 506 obs. of 2 variables: > $ timestamp: POSIXct, format: "2014-08-25 00:00:00" "2014-08-25 00:10:00" ... > $ count : num 40465895 54157589 34727655 38576160 36686470 ......
This PR adds RaySGD API to Autodist which enables it to train models on a Ray cluster. The API defines a `TFTrainer` class which takes a model creator, data creator,...
**Please describe the bug** `example/linear_regression.py` with AllReduce strategy crashes when run on a CPU-only multinode cluster with the resource spec like: ``` nodes: - address: X.X.X.X cpus: [0] chief: true...