torchx kubernetes_scheduler: volcano doesn't support Kubernetes 1.22

🐛 Bug

Kubernetes hello world does not run, instead returns KeyError when requesting description and I don't think ever launches (does not appear in jobs). There is also a typo in the example (torchx/schedulers/kubernetes_scheduler:ln245) should be --scheduler_args not --scheduler_opts.

kubectl get jobs -A
NAMESPACE        NAME                     COMPLETIONS   DURATION   AGE
volcano-system   volcano-admission-init   1/1           3s         115m

Module (check all that applies):

[ ] torchx.spec
[ ] torchx.component
[ ] torchx.apps
[ ] torchx.runtime
[x] torchx.cli
[x] torchx.schedulers
[ ] torchx.pipelines
[ ] torchx.aws
[x] torchx.examples
[ ] other

To Reproduce

Steps to reproduce the behavior:

Install Kubernetes 1.22 and gpu-operator (following https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#step-0-before-you-begin) (I also added one extra node to the cluster)
Install Volcano kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.3.0/installer/volcano-development.yaml
git clone https://github.com/pytorch/torchx.git && cd torchx && sudo python3 -m pip install -e . (Kubernetes not on pypi yet)
torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=test utils.echo --msg hello

torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=test utils.echo --msg hello
kubernetes://torchx_monash/default:echo-8mgdh
=== RUN RESULT ===
Launched app: kubernetes://torchx_monash/default:echo-8mgdh
Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 33, in <module>
    sys.exit(load_entry_point('torchx', 'console_scripts', 'torchx')())
  File "/home/monash/torchx/torchx/cli/main.py", line 62, in main
    args.func(args)
  File "/home/monash/torchx/torchx/cli/cmd_run.py", line 120, in run
    status = runner.status(app_handle)
  File "/home/monash/torchx/torchx/runner/api.py", line 294, in status
    desc = scheduler.describe(app_id)
  File "/home/monash/torchx/torchx/schedulers/kubernetes_scheduler.py", line 342, in describe
    status = resp["status"]
KeyError: 'status'

Expected behavior

Return without error and can retrieve status or description of job.

Environment

torchx version (e.g. 0.1.0rc1): master
Python version: 3.8.10
OS (e.g., Linux): Ubuntu 20.04
How you installed torchx (conda, pip, source, docker):
Docker image and tag (if using docker):
Git commit (if installed from source): 037e716a368a2346173fdbcc9b0d879fc5e556af
Execution environment (on-prem, AWS, GCP, Azure etc): on-prem
Any other relevant information:

Additional context

My usecase is 4 workstations with 2 gpus each and doing distributed or shared training amonst a small university research group. I'm trying to just get this hello-world working before I start trying to run my distributed code which I have working on a single node system with torch.distributed.run --standalone (args...).

Aug 07 '21 06:08 5had3z

Inserting lines at torchx/schedulers/kubernetes_scheduler.py:ln343, just before status = resp['status']

for key_, obj_ in resp.items():
    print(f"{key_}: {obj_}")

Output is below:

apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: {'creationTimestamp': '2021-08-07T06:56:36Z', 'generateName': 'echo-', 'generation': 1, 'managedFields': [{'apiVersion': 'batch.volcano.sh/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:generateName': {}}, 'f:spec': {'.': {}, 'f:maxRetry': {}, 'f:plugins': {'.': {}, 'f:env': {}, 'f:svc': {}}, 'f:queue': {}, 'f:schedulerName': {}, 'f:tasks': {}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2021-08-07T06:56:36Z'}], 'name': 'echo-5rksp', 'namespace': 'default', 'resourceVersion': '475515', 'uid': 'b178cdba-7773-4e8d-8893-8727c628b1a3'} spec: {'maxRetry': 0, 'plugins': {'env': [], 'svc': []}, 'queue': 'test', 'schedulerName': 'volcano', 'tasks': [{'maxRetry': 0, 'name': 'echo-0', 'policies': [{'action': 'RestartJob', 'event': 'PodEvicted'}, {'action': 'RestartJob', 'event': 'PodFailed'}], 'replicas': 1, 'template': {'spec': {'containers': [{'command': ['/bin/echo', 'hello'], 'env': [], 'image': '/tmp', 'name': 'echo-0', 'ports': [], 'resources': {'limits': {}, 'requests': {}}}], 'restartPolicy': 'Never'}}}]}

so the call at ln335 is just returning the kubernetes Job configuration rather than its status?

Aug 07 '21 07:08 5had3z

Hi, thanks for reporting this! I'll take a look and see if I can reproduce it

Aug 09 '21 18:08 d4l3k

I ran into this issue when I created a cluster with no workers but once I create the workers it seems to be just fine. We've tested this on 1.18 and 1.21. I don't have access to a 1.22 cluster but I'll try to spin up a local one on my laptop.

Can you install vcctl and send me the output from:

$ vcctl queue list
$ vcctl job list
$ kubectl get job.batch.volcano.sh/<jobid> -o yaml

This worked for me:

$ eksctl create cluster \
  --name torchx-dev-1-21 \
  --version 1.21 \
  --with-oidc \
  --without-nodegroup
            
$ eksctl create nodegroup \
  --cluster torchx-dev-1-21 \
  --name torchx-dev-1-21-workers \
  --node-type t3.medium \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 4 \
  --ssh-access \
  --ssh-public-key <key>
  
$ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.3.0/installer/volcano-development.yaml
                                 
$ vcctl queue create test
$ torchx run --scheduler kubernetes --scheduler_args queue=test utils.echo --image alpine:latest --msg hello
kubernetes://torchx_tristanr/default:echo-dhbfd
=== RUN RESULT ===
Launched app: kubernetes://torchx_tristanr/default:echo-dhbfd
AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

Job URL: None
$ torchx log kubernetes://torchx_tristanr/default:echo-dhbfd/echo
echo/0 2021-08-09T20:57:01.479780331Z hello

Aug 09 '21 20:08 d4l3k

Looks like there's a compatibility issue between Volcano v1.3.0 and Kubernetes v1.22

https://kubernetes.io/docs/reference/using-api/deprecation-guide/#priorityclass-v122

tristanr@tristanr-arch2 ~> kubectl logs --namespace volcano-system pods/volcano-scheduler-5665cdc4d9-cv5kx
W0809 21:56:49.894796       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0809 21:56:49.920722       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
I0809 21:56:49.931104       1 event_handlers.go:199] Added pod <volcano-system/volcano-admission-init--1-srdfl> into cache.
I0809 21:56:49.931138       1 event_handlers.go:199] Added pod <volcano-system/volcano-controllers-5bbcc9c49f-fq5fl> into cache.
I0809 21:56:49.931147       1 event_handlers.go:199] Added pod <volcano-system/volcano-scheduler-5665cdc4d9-cv5kx> into cache.
I0809 21:56:49.931158       1 event_handlers.go:199] Added pod <kube-system/kube-controller-manager-minikube> into cache.
I0809 21:56:49.931171       1 event_handlers.go:199] Added pod <kube-system/kube-apiserver-minikube> into cache.
I0809 21:56:49.931178       1 event_handlers.go:199] Added pod <kube-system/storage-provisioner> into cache.
I0809 21:56:49.931185       1 event_handlers.go:199] Added pod <kube-system/coredns-78fcd69978-hrs6m> into cache.
I0809 21:56:49.931191       1 event_handlers.go:199] Added pod <kube-system/etcd-minikube> into cache.
I0809 21:56:49.931199       1 event_handlers.go:199] Added pod <kube-system/kube-scheduler-minikube> into cache.
I0809 21:56:49.931212       1 event_handlers.go:199] Added pod <kube-system/kube-proxy-4kkmq> into cache.
I0809 21:56:49.931231       1 event_handlers.go:199] Added pod <volcano-system/volcano-admission-5bb77cd5b7-zxqf9> into cache.
E0809 21:56:51.279727       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:56:53.375096       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:56:58.251774       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:57:09.886813       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:57:25.430937       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:58:00.806485       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:58:40.374955       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource

Aug 09 '21 22:08 d4l3k

Yes, I was just in the process of commenting that it seems like things are running, but as soon I peeked behind the curtain at the logs I saw that error. I'm giving up on 1.22, ElasticOperator, Volcano and PytorchOperator are all using ~~depricated~~ now removed APIs. I'm going into uni today to just reset all the workstations and load on a fresh version 1.21.3 and hopefully get at least one of them to run so I can get back to actual research.

Aug 09 '21 22:08 5had3z

I think the path forward here is:

make torchx kubernetes scheduler robust to missing status and show "UNKNOWN" status for the job
file an issue on Volcano for 1.22 compatibility
update torchx documentation to show compatible versions

Aug 09 '21 22:08 d4l3k

Unless there is a more proper way to check to see if the job has launched successfully, I think this could be treated as an indirect measurement of determining that the job didn't start successfully for whatever reason.

try:
    status = resp["status"]
except KeyError:
    raise RuntimeError("Failed to retrieve status, job possibly didn't start???")

Aug 09 '21 22:08 5had3z

The method throwing this error is in describe which just describes the job so it doesn't make a ton of sense to throw that error there. Possibly a warning? but that's a bit clunky

We do have an UNKNOWN appstate that would be a good fit for this https://github.com/pytorch/torchx/blob/master/torchx/specs/api.py#L316

If we do want to throw an error might be good to add something in the CLI run instead on unknown status

Aug 09 '21 23:08 d4l3k

Filed https://github.com/volcano-sh/volcano/issues/1665

Aug 10 '21 00:08 d4l3k

This is still an outstanding issue with volcano v1.4

Nov 02 '21 21:11 d4l3k

This might be fixed with volcano v1.5 beta but I haven't tested it. The volcano issue is still open. https://github.com/volcano-sh/volcano/releases/tag/v1.5.0-Beta

Feb 01 '22 22:02 d4l3k

@d4l3k is this no longer an issue?

Nov 29 '22 12:11 tiagovrtr

Yes this is fixed on the Volcano side with the newer volcano releases

Nov 29 '22 16:11 d4l3k