ml-hub Hub and Proxy running but getting 502 Bad gateway

Describe the bug:

It looks like sometimes, the proxy loose connection with the hub and we need to kill the proxy pod to force it's recreation.

Expected behaviour:

I expect the application to be reachable even after a restart of the hub

Steps to reproduce the issue:

~~helm install~~ ~~kubectl delete pod hub-***~~ ~~you should see 502 bad gateway even after the hub pod is shown as running~~

EDIT: rebooting the node seems to be the only wait to reproduce the issue consistently

Possible Fix:

A quick fix might be to put liveness probes on the proxy pods to ensure the connexion still exists but there might be a better fix.

Dec 15 '21 16:12 ClementGautier

So, after spending more time on that issue I can tell you it's not that easy to reproduce. I haven't been able to reproduce yet and I think it's related to NetworkPolicies, the issue seems very similar to this issue: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1863 I will try to have a reproducible case.

Dec 22 '21 13:12 ClementGautier

I was able to confirm that the issue is between the ingress and the service, most likely a firewall issue. I'm getting connection resets between the ingress controller and the service. 10.56.2.25 is the proxy-public service enpoint while .16 is the nginx ingress controller

Dec 22 '21 14:12 ClementGautier

I think I found the issue: the proxy-public service points to the port 8080 of the proxy pod but this one doesn't listen to this port and instead listen to the port 8000... editing the service to use the 8000 fixed the issue. After more digging in the values it seems I needed to use proxy.https.type: offload in combination with mlhub.env.SSL_ENABLED: true. This configure the service "properly", but then, the ingress don't work at all as the target port is hardcoded to servicePort: 80. So I disabled the ingress templating and created it manually as a temporary workaround.

I still don't understand how restarting the proxy pod made it work all of the sudden. I think it have to do with the environment variables being set and the behavior of the proxy itself but I guess if you launch the container with the option --port 8000 you should probably use that port anyway.

I'll make a pull request in that direction soon

Dec 22 '21 16:12 ClementGautier

The issue is already fixed in the jupyterhub chart so instead of doing things twice I'd rather prefer using this as dependency for this chart as discussed in https://github.com/ml-tooling/ml-hub/issues/25

Dec 22 '21 17:12 ClementGautier