locker_docker scheduler hostname length exceeded
🐛 Bug
In DockerScheduler._submit_dryrun, the keyword argument for docker.containers.run hostname is set to name: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L280
name is set to
name = f"{app_id}-{role.name}-{replica_id}"
https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L259C17-L260C1
It is typical/common for a component to use torchx StructuredNameArgument which has:
parse_from(
name: str,
m: Optional[str] = None,
script: Optional[str] = None,
default_experiment_name: str = "default-experiment",
)
Then set the structured_name.run_name to the spec.AppDef name argument, which get used as the app_id.
If the app_id is long enough such that the generated name exceeds the maximum allowed DNS hostname length (63 characters) it results the torchx run command failing with a difficult to troubleshoot bug:
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.41/containers/a932ov2fff/start: Internal Server Error ("failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: sethostname: invalid argument: unknown")
Module (check all that applies):
- [ ]
torchx.spec - [ x]
torchx.component - [ ]
torchx.apps - [ ]
torchx.runtime - [ ]
torchx.cli - [x]
torchx.schedulers - [ ]
torchx.pipelines - [ ]
torchx.aws - [ ]
torchx.examples - [ ]
other
torchx.component.structured_arg.py
torchx.scheduler.docker_scheduler
To Reproduce
Steps to reproduce the behavior:
- Use any component with the local_docker scheduler with a long enough generated app name (ex from entrypoint)
- Check terminal
Expected behavior
Either this should get truncated to 63 char and/or shorten the generated name.
Environment
- torchx version (e.g. 0.1.0rc1):
- Python version:
- OS (e.g., Linux):
- How you installed torchx (
conda,pip, source,docker): - Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information: