[Job Submission][refactor 5/N] Remove the head node dependency on the `Raylet` process
Signed-off-by: Catch-Bull [email protected]
Why are these changes needed?
Move all interfaces that depend on the raylet process to JobAgent. After this PR, the head node no longer needs to start the raylet process.
main content:
- remove
JobManager: All interfaces that requireJobManagerare implemented by callingJobAgentSubmissionClient. - delete
init_ray_and_catch_exceptionsfromJobHead,JobHeadwill not callray.init.
Related issue number
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
@Catch-Bull could you give the PR a better name and description? It's really confusing about what this one is doing just from the name.
Some CI jobs tests failed, e.g. https://buildkite.com/ray-project/oss-ci-build-pr/builds/431#01837a6f-fb4c-4e09-97e4-c9a57fdb10c9/3394-4825 , could you please take a look?
Some CI jobs tests failed, e.g. buildkite.com/ray-project/oss-ci-build-pr/builds/431#01837a6f-fb4c-4e09-97e4-c9a57fdb10c9/3394-4825 , could you please take a look?
I tried to fix it, and now all three tests can finish on my machine
@Catch-Bull Possibly relevant failure? https://buildkite.com/ray-project/oss-ci-build-pr/builds/566#01837fbf-2e34-449b-8125-1834439b9e51/3305-3805
@Catch-Bull Possibly relevant failure? buildkite.com/ray-project/oss-ci-build-pr/builds/566#01837fbf-2e34-449b-8125-1834439b9e51/3305-3805
@architkulkarni I fixed the UT again and it looks fine now.
Looks like it's breaking the tests
- linux://cpp:test_submit_cpp_job
- osx://cpp:test_submit_cpp_job
https://buildkite.com/ray-project/oss-ci-build-branch/builds/287#01838f20-9bcf-4574-a706-843d085b2226
The cpp fix might be simple, we just need to add support for address="auto" to C++ drivers, or simply ignore address="auto" if we see it (maybe we can just ignore it, because it seemed to be working before.)
A second issue is that test_sdk became flaky on Linux and Mac:

Example: https://buildkite.com/ray-project/oss-ci-build-branch/builds/296#018391ff-6bde-4085-9405-d1f9e3f24f90/3377-3588
I will look into both of these but would greatly appreciate your help if you have any ideas! @Catch-Bull @SongGuyang
Also, we should make sure the C++ job test gets run even when there are changes to the Jobs code, even if there are no C++ changes
There's another type of flakiness: https://buildkite.com/ray-project/oss-ci-build-branch/builds/304#01839baf-cf05-4e32-a4b4-88608e7e9e0a
E File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-0428bb49501460ad6-1/ray-project/oss-ci-build-branch/python/ray/dashboard/datacenter.py", line 234, in get_all_agent_infos
--
| E grpcPort=int(grpc_port),
| E TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'