ray icon indicating copy to clipboard operation
ray copied to clipboard

[Job Submission][refactor 5/N] Remove the head node dependency on the `Raylet` process

Open Catch-Bull opened this issue 3 years ago • 1 comments

Signed-off-by: Catch-Bull [email protected]

Why are these changes needed?

Move all interfaces that depend on the raylet process to JobAgent. After this PR, the head node no longer needs to start the raylet process.

main content:

  1. remove JobManager: All interfaces that require JobManager are implemented by calling JobAgentSubmissionClient.
  2. delete init_ray_and_catch_exceptions from JobHead, JobHead will not call ray.init.

Related issue number

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

Catch-Bull avatar Sep 19 '22 11:09 Catch-Bull

@Catch-Bull could you give the PR a better name and description? It's really confusing about what this one is doing just from the name.

fishbone avatar Sep 19 '22 17:09 fishbone

Some CI jobs tests failed, e.g. https://buildkite.com/ray-project/oss-ci-build-pr/builds/431#01837a6f-fb4c-4e09-97e4-c9a57fdb10c9/3394-4825 , could you please take a look?

architkulkarni avatar Sep 27 '22 00:09 architkulkarni

Some CI jobs tests failed, e.g. buildkite.com/ray-project/oss-ci-build-pr/builds/431#01837a6f-fb4c-4e09-97e4-c9a57fdb10c9/3394-4825 , could you please take a look?

I tried to fix it, and now all three tests can finish on my machine

Catch-Bull avatar Sep 27 '22 16:09 Catch-Bull

@Catch-Bull Possibly relevant failure? https://buildkite.com/ray-project/oss-ci-build-pr/builds/566#01837fbf-2e34-449b-8125-1834439b9e51/3305-3805

architkulkarni avatar Sep 27 '22 18:09 architkulkarni

@Catch-Bull Possibly relevant failure? buildkite.com/ray-project/oss-ci-build-pr/builds/566#01837fbf-2e34-449b-8125-1834439b9e51/3305-3805

@architkulkarni I fixed the UT again and it looks fine now.

Catch-Bull avatar Sep 28 '22 16:09 Catch-Bull

Looks like it's breaking the tests

  • linux://cpp:test_submit_cpp_job
  • osx://cpp:test_submit_cpp_job

https://buildkite.com/ray-project/oss-ci-build-branch/builds/287#01838f20-9bcf-4574-a706-843d085b2226

image image

rickyyx avatar Sep 30 '22 18:09 rickyyx

The cpp fix might be simple, we just need to add support for address="auto" to C++ drivers, or simply ignore address="auto" if we see it (maybe we can just ignore it, because it seemed to be working before.)

A second issue is that test_sdk became flaky on Linux and Mac: Screen Shot 2022-10-03 at 11 55 47 AM

Example: https://buildkite.com/ray-project/oss-ci-build-branch/builds/296#018391ff-6bde-4085-9405-d1f9e3f24f90/3377-3588

I will look into both of these but would greatly appreciate your help if you have any ideas! @Catch-Bull @SongGuyang

architkulkarni avatar Oct 03 '22 19:10 architkulkarni

Also, we should make sure the C++ job test gets run even when there are changes to the Jobs code, even if there are no C++ changes

architkulkarni avatar Oct 03 '22 19:10 architkulkarni

There's another type of flakiness: https://buildkite.com/ray-project/oss-ci-build-branch/builds/304#01839baf-cf05-4e32-a4b4-88608e7e9e0a

E         File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-0428bb49501460ad6-1/ray-project/oss-ci-build-branch/python/ray/dashboard/datacenter.py", line 234, in get_all_agent_infos
--
  | E           grpcPort=int(grpc_port),
  | E       TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

architkulkarni avatar Oct 03 '22 23:10 architkulkarni