[RayService] can not support cross language
What happened + What you expected to happen
Background l want to deploy my Ray serve using RayService. There are two applications: solution and pipeline_proxy. The solution is defined by Python and it can be deploy successfully. And the pipeline_proxy is also defined by Python, but in this script, it needs to invoke a java function to create some Ray serve deployments with Java Class. RayService CRD
serveConfigV2: |
applications:
- name: solution
route_prefix: /solution
import_path: app:app_builder
runtime_env:
working_dir: file:///solution.zip
- name: pipeline_proxy
import_path: pipeline_app:app_builder
runtime_env:
working_dir: file:///solution.zip
Problem
- How to specify job_config.code_search_path?(There is no this field in ServeDeploySchema) If there is no this config, will raise
Cross language feature needs --load-code-from-local to be set. - l hard code to set job_config.code_search_path in worker.py, and also raise a exception
set job_config.code_search_path(worker.py)
if job_config is None:
# job_config = ray.job_config.JobConfig()
job_config = ray.job_config.JobConfig(code_search_path=["/mount/data/pipeline.jar"])
exception
(raylet) Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2086, in ray._raylet.execute_task_with_cancellation_handler
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/function_manager.py", line 384, in get_execution_info
if self._load_function_from_local(function_descriptor) is True:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/function_manager.py", line 417, in _load_function_from_local
function = object._function
AttributeError: 'function' object has no attribute '_function'
An unexpected internal error occurred while the worker was executing a task.
(raylet) Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 2086, in ray._raylet.execute_task_with_cancellation_handler
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/function_manager.py", line 384, in get_execution_info
if self._load_function_from_local(function_descriptor) is True:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/function_manager.py", line 417, in _load_function_from_local
function = object._function
AttributeError: 'function' object has no attribute '_function'
Versions / Dependencies
Ray: 2.9.1 Python: 3.10.13
Reproduction script
There is a application defined by Python, and it invokes a java function to create a Ray serve deployment with Java Class. RayService CRD
serveConfigV2: |
applications:
- name: solution
route_prefix: /solution
import_path: app:app_builder
runtime_env:
working_dir: file:///solution.zip
Issue Severity
High: It blocks me from completing my task.
Hi @anyscalesam could you help take a look?
Hi @chenzhengfei, we can have a meeting together with @yucai to discuss this. Thanks!
Sure, thanks a lot. @kevin85421
Hi @kevin85421 , after exploration, we found this is caused by two sub issues. And we have a proposed solution for you to review.
Currently we have test this solution in our environment and proved it works.
Problem Identify
- The code_search_path parameter is essential for invoking Java actors or tasks. However, at present, there is no option to configure this parameter within the RayService CRD.
- Even if we attempt to work around this configuration issue, the deployment fails due to an unidentified exception.
Proposed Solution
Code Search Path Support Currently, Ray supports the environment variable RAY_JOB_CONFIG_JSON_ENV_VAR for injecting runtime_env settings. However, the code_search_path is not incorporated into this process, which leads to the aforementioned 1st issue.
Our solution is to merge code_search_path in the worker init function.
Deployment Fails Bug Fix The 2nd issue is caused by set_load_code_from_local, the detail path is listed as following:
- When the code_search_path is specified, Ray automatically enables the load_code_from_local option to facilitate the loading of Java libraries.
if job_config and job_config.code_search_path:
global_worker.set_load_code_from_local(True)
- After load_code_from_local is enabled, the function manager in ray will use importlib.import_module to load the function.
However, ray will assume the function is already wrappered by @ray.remote decorator, and call function = object._function to get the underline python function
class FunctionActorManager:
........
def get_execution_info(self, job_id, function_descriptor):
.......
if self._worker.load_code_from_local:
# Load function from local code.
if not function_descriptor.is_actor_method():
# If the function is not able to be loaded,
# try to load it from GCS,
# even if load_code_from_local is set True
if self._load_function_from_local(function_descriptor) is True:
return self._function_execution_info[function_id]
.......
def _load_function_from_local(self, function_descriptor):
assert not function_descriptor.is_actor_method()
function_id = function_descriptor.function_id
module_name, function_name = (
function_descriptor.module_name,
function_descriptor.function_name,
)
object = self.load_function_or_class_from_local(module_name, function_name)
if object is not None:
function = object._function
self._function_execution_info[function_id] = FunctionExecutionInfo(
function=function,
function_name=function_name,
max_calls=0,
)
self._num_task_executions[function_id] = 0
return True
else:
return False
def load_function_or_class_from_local(self, module_name, function_or_class_name):
"""Try to load a function or class in the module from local."""
module = importlib.import_module(module_name)
parts = [part for part in function_or_class_name.split(".") if part]
object = module
try:
for part in parts:
object = getattr(object, part)
return object
except Exception:
return None
- In Ray Serve, the _start_controller function is not decorated with @ray.remote by default. Instead, it is dynamically wrapped within the serve_start_async function. This implies that:
The load_function_or_class_from_local returns a pure Python function that lacks the @ray.remote decorator. The assignment function = object._function fails because object is expected to be a Ray remote function, but it is actually a pure Python function.
def _start_controller(
http_options: Union[None, dict, HTTPOptions] = None,
grpc_options: Union[None, dict, gRPCOptions] = None,
global_logging_config: Union[None, dict, LoggingConfig] = None,
**kwargs,
) -> Tuple[ActorHandle, str]:
"""Start Ray Serve controller.
.......
async def serve_start_async(
http_options: Union[None, dict, HTTPOptions] = None,
grpc_options: Union[None, dict, gRPCOptions] = None,
global_logging_config: Union[None, dict, LoggingConfig] = None,
**kwargs,
) -> ServeControllerClient:
.....
controller, controller_name = (
await ray.remote(_start_controller)
.options(num_cpus=0)
.remote(http_options, grpc_options, global_logging_config, **kwargs)
)
client = ServeControllerClient(
controller,
controller_name,
)
_set_global_client(client)
logger.info(f'Started Serve in namespace "{SERVE_NAMESPACE}".')
return client
Upon examining the source code, we discovered that for actors, there is a conditional check to determine whether the class is a pure Python class or a Ray actor class in _load_actor_class_from_local. Consequently, we have decided to implement a comparable check within the _load_function_from_local method. This will check whether the given function is a standard Python function or a Ray remote function, thereby resolving the issue.
def _load_actor_class_from_local(self, actor_creation_function_descriptor):
"""Load actor class from local code."""
module_name, class_name = (
actor_creation_function_descriptor.module_name,
actor_creation_function_descriptor.class_name,
)
object = self.load_function_or_class_from_local(module_name, class_name)
if object is not None:
if isinstance(object, ray.actor.ActorClass):
return object.__ray_metadata__.modified_class
else:
return object
else:
return None
Hi @kevin85421, I create two PR for this fix, can you help review it? Thanks.
Hi @anyscalesam , Could you please assist with the review at your convenience? Thank you!
@kevin85421 this is more runtime env part of the codepath versus kuberay right; do you need help from @rynewang to review?
next review on monday... will let you know on next steps then..