kiukchung

Results 66 comments of kiukchung

Here's another example of a failed KFP integ test that is hard to quickly tell what the root cause of failure was: https://github.com/pytorch/torchx/runs/6595915396?check_suite_focus=true

> re: The job logs are created in per node files https://github.com/pytorch/torchx/pull/412 makes it so that when running with `dist.ddp` the node stdout and stderr log lines are prefixed with...

is this still an issue? can we fix as part of 0.1.2?

@d4l3k any issues on volcano side that you can link here to help us track it?

Hi there, adding a new scheduler to TorchX is quite straight forward. Here are the basic steps: 1. Subclass the [`torchx.schedulers.Scheduler`](https://pytorch.org/torchx/latest/schedulers.html#scheduler-classes) interface. There are a few methods you need to...

We could definitely do better. The `TORCHXCONFIG` env var is documented as part of the `torchx.runner.config.find_configs()` [method pydocs](https://pytorch.org/torchx/main/runner.config.html), but it won't be obvious to the user looking at the top...

FWIW this PR adds scheduler runoptions to each scheduler docs page by using SphinxDirective to query the runopts from the schedulers and generating the docs page. https://github.com/pytorch/torchx/pull/374

is this from torch-1.10 or torchelastic-0.1.0rc1? if the former, then can you move this issue to pytorch and tag it with the "elastic" tag and assign it to me for...

thanks for the PR! LGTM, just to clarify, this changes the torch dependency not python correct? if so could you edit the title of the PR?