torchx icon indicating copy to clipboard operation
torchx copied to clipboard

locker_docker scheduler hostname length exceeded

Open ryxli opened this issue 1 year ago • 0 comments

🐛 Bug

In DockerScheduler._submit_dryrun, the keyword argument for docker.containers.run hostname is set to name: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L280

name is set to

name = f"{app_id}-{role.name}-{replica_id}"

https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L259C17-L260C1

It is typical/common for a component to use torchx StructuredNameArgument which has:

parse_from(
        name: str,
        m: Optional[str] = None,
        script: Optional[str] = None,
        default_experiment_name: str = "default-experiment",
    ) 

Then set the structured_name.run_name to the spec.AppDef name argument, which get used as the app_id.

If the app_id is long enough such that the generated name exceeds the maximum allowed DNS hostname length (63 characters) it results the torchx run command failing with a difficult to troubleshoot bug:

docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.41/containers/a932ov2fff/start: Internal Server Error ("failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: sethostname: invalid argument: unknown")

Module (check all that applies):

  • [ ] torchx.spec
  • [ x] torchx.component
  • [ ] torchx.apps
  • [ ] torchx.runtime
  • [ ] torchx.cli
  • [x] torchx.schedulers
  • [ ] torchx.pipelines
  • [ ] torchx.aws
  • [ ] torchx.examples
  • [ ] other

torchx.component.structured_arg.py

torchx.scheduler.docker_scheduler

To Reproduce

Steps to reproduce the behavior:

  1. Use any component with the local_docker scheduler with a long enough generated app name (ex from entrypoint)
  2. Check terminal

Expected behavior

Either this should get truncated to 63 char and/or shorten the generated name.

Environment

  • torchx version (e.g. 0.1.0rc1):
  • Python version:
  • OS (e.g., Linux):
  • How you installed torchx (conda, pip, source, docker):
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:

Additional context

ryxli avatar Feb 14 '24 22:02 ryxli