kubernetes_scheduler: volcano doesn't support Kubernetes 1.22
🐛 Bug
Kubernetes hello world does not run, instead returns KeyError when requesting description and I don't think ever launches (does not appear in jobs). There is also a typo in the example (torchx/schedulers/kubernetes_scheduler:ln245) should be --scheduler_args not --scheduler_opts.
kubectl get jobs -A
NAMESPACE NAME COMPLETIONS DURATION AGE
volcano-system volcano-admission-init 1/1 3s 115m
Module (check all that applies):
- [ ]
torchx.spec - [ ]
torchx.component - [ ]
torchx.apps - [ ]
torchx.runtime - [x]
torchx.cli - [x]
torchx.schedulers - [ ]
torchx.pipelines - [ ]
torchx.aws - [x]
torchx.examples - [ ]
other
To Reproduce
Steps to reproduce the behavior:
- Install Kubernetes 1.22 and gpu-operator (following https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#step-0-before-you-begin) (I also added one extra node to the cluster)
- Install Volcano kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.3.0/installer/volcano-development.yaml
- git clone https://github.com/pytorch/torchx.git && cd torchx && sudo python3 -m pip install -e . (Kubernetes not on pypi yet)
- torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=test utils.echo --msg hello
torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=test utils.echo --msg hello
kubernetes://torchx_monash/default:echo-8mgdh
=== RUN RESULT ===
Launched app: kubernetes://torchx_monash/default:echo-8mgdh
Traceback (most recent call last):
File "/usr/local/bin/torchx", line 33, in <module>
sys.exit(load_entry_point('torchx', 'console_scripts', 'torchx')())
File "/home/monash/torchx/torchx/cli/main.py", line 62, in main
args.func(args)
File "/home/monash/torchx/torchx/cli/cmd_run.py", line 120, in run
status = runner.status(app_handle)
File "/home/monash/torchx/torchx/runner/api.py", line 294, in status
desc = scheduler.describe(app_id)
File "/home/monash/torchx/torchx/schedulers/kubernetes_scheduler.py", line 342, in describe
status = resp["status"]
KeyError: 'status'
Expected behavior
Return without error and can retrieve status or description of job.
Environment
- torchx version (e.g. 0.1.0rc1): master
- Python version: 3.8.10
- OS (e.g., Linux): Ubuntu 20.04
- How you installed torchx (
conda,pip, source,docker): - Docker image and tag (if using docker):
- Git commit (if installed from source): 037e716a368a2346173fdbcc9b0d879fc5e556af
- Execution environment (on-prem, AWS, GCP, Azure etc): on-prem
- Any other relevant information:
Additional context
My usecase is 4 workstations with 2 gpus each and doing distributed or shared training amonst a small university research group. I'm trying to just get this hello-world working before I start trying to run my distributed code which I have working on a single node system with torch.distributed.run --standalone (args...).
Inserting lines at torchx/schedulers/kubernetes_scheduler.py:ln343, just before status = resp['status']
for key_, obj_ in resp.items():
print(f"{key_}: {obj_}")
Output is below:
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: {'creationTimestamp': '2021-08-07T06:56:36Z', 'generateName': 'echo-', 'generation': 1, 'managedFields': [{'apiVersion': 'batch.volcano.sh/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:generateName': {}}, 'f:spec': {'.': {}, 'f:maxRetry': {}, 'f:plugins': {'.': {}, 'f:env': {}, 'f:svc': {}}, 'f:queue': {}, 'f:schedulerName': {}, 'f:tasks': {}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2021-08-07T06:56:36Z'}], 'name': 'echo-5rksp', 'namespace': 'default', 'resourceVersion': '475515', 'uid': 'b178cdba-7773-4e8d-8893-8727c628b1a3'} spec: {'maxRetry': 0, 'plugins': {'env': [], 'svc': []}, 'queue': 'test', 'schedulerName': 'volcano', 'tasks': [{'maxRetry': 0, 'name': 'echo-0', 'policies': [{'action': 'RestartJob', 'event': 'PodEvicted'}, {'action': 'RestartJob', 'event': 'PodFailed'}], 'replicas': 1, 'template': {'spec': {'containers': [{'command': ['/bin/echo', 'hello'], 'env': [], 'image': '/tmp', 'name': 'echo-0', 'ports': [], 'resources': {'limits': {}, 'requests': {}}}], 'restartPolicy': 'Never'}}}]}
so the call at ln335 is just returning the kubernetes Job configuration rather than its status?
Hi, thanks for reporting this! I'll take a look and see if I can reproduce it
I ran into this issue when I created a cluster with no workers but once I create the workers it seems to be just fine. We've tested this on 1.18 and 1.21. I don't have access to a 1.22 cluster but I'll try to spin up a local one on my laptop.
Can you install vcctl and send me the output from:
$ vcctl queue list
$ vcctl job list
$ kubectl get job.batch.volcano.sh/<jobid> -o yaml
This worked for me:
$ eksctl create cluster \
--name torchx-dev-1-21 \
--version 1.21 \
--with-oidc \
--without-nodegroup
$ eksctl create nodegroup \
--cluster torchx-dev-1-21 \
--name torchx-dev-1-21-workers \
--node-type t3.medium \
--nodes 3 \
--nodes-min 1 \
--nodes-max 4 \
--ssh-access \
--ssh-public-key <key>
$ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.3.0/installer/volcano-development.yaml
$ vcctl queue create test
$ torchx run --scheduler kubernetes --scheduler_args queue=test utils.echo --image alpine:latest --msg hello
kubernetes://torchx_tristanr/default:echo-dhbfd
=== RUN RESULT ===
Launched app: kubernetes://torchx_tristanr/default:echo-dhbfd
AppStatus:
msg: <NONE>
num_restarts: -1
roles: []
state: PENDING (2)
structured_error_msg: <NONE>
ui_url: null
Job URL: None
$ torchx log kubernetes://torchx_tristanr/default:echo-dhbfd/echo
echo/0 2021-08-09T20:57:01.479780331Z hello
Looks like there's a compatibility issue between Volcano v1.3.0 and Kubernetes v1.22
https://kubernetes.io/docs/reference/using-api/deprecation-guide/#priorityclass-v122
tristanr@tristanr-arch2 ~> kubectl logs --namespace volcano-system pods/volcano-scheduler-5665cdc4d9-cv5kx
W0809 21:56:49.894796 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
E0809 21:56:49.920722 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
I0809 21:56:49.931104 1 event_handlers.go:199] Added pod <volcano-system/volcano-admission-init--1-srdfl> into cache.
I0809 21:56:49.931138 1 event_handlers.go:199] Added pod <volcano-system/volcano-controllers-5bbcc9c49f-fq5fl> into cache.
I0809 21:56:49.931147 1 event_handlers.go:199] Added pod <volcano-system/volcano-scheduler-5665cdc4d9-cv5kx> into cache.
I0809 21:56:49.931158 1 event_handlers.go:199] Added pod <kube-system/kube-controller-manager-minikube> into cache.
I0809 21:56:49.931171 1 event_handlers.go:199] Added pod <kube-system/kube-apiserver-minikube> into cache.
I0809 21:56:49.931178 1 event_handlers.go:199] Added pod <kube-system/storage-provisioner> into cache.
I0809 21:56:49.931185 1 event_handlers.go:199] Added pod <kube-system/coredns-78fcd69978-hrs6m> into cache.
I0809 21:56:49.931191 1 event_handlers.go:199] Added pod <kube-system/etcd-minikube> into cache.
I0809 21:56:49.931199 1 event_handlers.go:199] Added pod <kube-system/kube-scheduler-minikube> into cache.
I0809 21:56:49.931212 1 event_handlers.go:199] Added pod <kube-system/kube-proxy-4kkmq> into cache.
I0809 21:56:49.931231 1 event_handlers.go:199] Added pod <volcano-system/volcano-admission-5bb77cd5b7-zxqf9> into cache.
E0809 21:56:51.279727 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:56:53.375096 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:56:58.251774 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:57:09.886813 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:57:25.430937 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:58:00.806485 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:58:40.374955 1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
Yes, I was just in the process of commenting that it seems like things are running, but as soon I peeked behind the curtain at the logs I saw that error. I'm giving up on 1.22, ElasticOperator, Volcano and PytorchOperator are all using ~~depricated~~ now removed APIs. I'm going into uni today to just reset all the workstations and load on a fresh version 1.21.3 and hopefully get at least one of them to run so I can get back to actual research.
I think the path forward here is:
- make torchx kubernetes scheduler robust to missing status and show "UNKNOWN" status for the job
- file an issue on Volcano for 1.22 compatibility
- update torchx documentation to show compatible versions
Unless there is a more proper way to check to see if the job has launched successfully, I think this could be treated as an indirect measurement of determining that the job didn't start successfully for whatever reason.
try:
status = resp["status"]
except KeyError:
raise RuntimeError("Failed to retrieve status, job possibly didn't start???")
The method throwing this error is in describe which just describes the job so it doesn't make a ton of sense to throw that error there. Possibly a warning? but that's a bit clunky
We do have an UNKNOWN appstate that would be a good fit for this https://github.com/pytorch/torchx/blob/master/torchx/specs/api.py#L316
If we do want to throw an error might be good to add something in the CLI run instead on unknown status
Filed https://github.com/volcano-sh/volcano/issues/1665
This is still an outstanding issue with volcano v1.4
This might be fixed with volcano v1.5 beta but I haven't tested it. The volcano issue is still open. https://github.com/volcano-sh/volcano/releases/tag/v1.5.0-Beta
@d4l3k is this no longer an issue?
Yes this is fixed on the Volcano side with the newer volcano releases