dask-labextension icon indicating copy to clipboard operation
dask-labextension copied to clipboard

Dask Kubernetes Issue (jupyter-server-proxy >= 1.2.0)

Open iwalmsley opened this issue 5 years ago • 29 comments

This is a replica of #129 (sorry couldn't see an option to re-open), apologies I didn't reply at that point.

We have been getting by with using jupyter-server-proxy 1.1.0 up until now but there are some changes to the more recent versions of libraries that means we need to upgrade and unfortunately am still tearing my hair out with this problem! All of the analysis below is still relevant but I have upgraded to the latest versions.

Dask Labextension Version: 3.0.0 Dask Version: 2.23 Distributed Version: 2.23 Dask Kubernetes Version: 0.10

First off apologies if this isn't related to the Dask Labextension directly, but that's where I'm seeing the issue - if not if you could point me in the right direction that'd be great.

Have recently come across an issue with the Dask labextension when after updating our Docker images all of the graph links (Task Stream, Progress) etc are all greyed out after creating a cluster on Kubernetes. The cluster itself is spun up fine, however still can't access the graphs, when looking at the network tab there's a 403 for;

https://notebook.domain/dask/dashboard/ba076ca2-49be-4262-b4a2-c4ff39eeb2d6/individual-plots.json?1589202704253

When I browse to this I get an error presumably from jupyter-server-proxy;

Host '10.42.3.219' is not whitelisted. See https://jupyter-server-proxy.readthedocs.io/en/latest/arbitrary-ports-hosts.html for info.

Which is the IP address of the JupyterLab pod. This indicates that the pod's IP needs to be in the ServerProxy.host_whitelist, but I don't understand why this request isn't being recognised from 127.0.0.1/localhost? Additionally I haven't been able to get it working even after adding this IP into the whitelist anyway.

If anyone could point me in the right direction that'd be great. Thanks in advance.

iwalmsley avatar Sep 03 '20 15:09 iwalmsley

Hi @iwalmsley thanks for raising this.

It sounds like a config issue with your Jupyter setup. Could you share how you created the Jupyter session in the first place? As that will probably be a more appropriate place for this issue?

jacobtomlinson avatar Sep 09 '20 10:09 jacobtomlinson

Hi @jacobtomlinson thanks very much for the reply.

Yes the setup consists of a Kubernetes Cluster with an nginx-ingress-controller, JupyterLab is then exposed via an ingress and users access it remotely in this fashion. (so '10.42.3.21' is the pod/CNI IP of the jupyterlab instance rather than the network IP of a host or something like that) Is this the detail you require?

I did have an inkling that it was something to do with the infrastructure setup but have just been testing this morning and can confirm that if I just change the version of jupyter-server-proxy tp 1.1.0 it's all working again before switching it to 1.2.0 which breaks it so wasn't entirely convinced.

iwalmsley avatar Sep 10 '20 13:09 iwalmsley

Thanks for the details @iwalmsley.

Do you still experience this if you switch to a more recent version of jupyter-server-proxy? The latest version is 1.5.0.

By default localhost and 127.0.0.1 are whitelisted. The lab extension will likely be picking up the IP from however the dask.distributed.Client is connected. So how is the client connecting to the scheduler?

jacobtomlinson avatar Sep 10 '20 13:09 jacobtomlinson

Yes still get the same issue with all versions > 1.1.0.

The client connecting via the auto-generated snippet from the Dask labextension e.g;

from dask.distributed import Client

client = Client("tcp://10.42.7.237:45019")
client

This works absolutely fine - I can schedule jobs which run okay on the Dask cluster, so I don't think that's a problem. It's just with viewing the graphs etc - have attached a screenshot hopefully showing.

DaskExample

iwalmsley avatar Sep 10 '20 13:09 iwalmsley

@iwalmsley sure, so you are connecting to the scheduler via the pod IP, so the dashboard is also going to try and connect via the pod IP. If this address isn't whitelisted then you are going to see these errors.

The solution here is going to be around configuring things correctly, but it's not immediately clear how things should be configured.

Is the scheduler running in a separate pod? Or in the same pod as Jupyter?

jacobtomlinson avatar Sep 10 '20 14:09 jacobtomlinson

Thanks @jacobtomlinson. That makes sense. I did originally try adding the ServerProxy.host_whitelist flag with the IP address of the pod to the jupyter deployment & rescheduling the pod (making sure the IP was the same which took some time..) and still had the same issue, though perhaps I didn't get that quite right.

Yes the scheduler is being brought up in the Jupyter pod itself (I guess by the Dask-Labextension itself/dask-kubernetes), can see in the Jupyter logs themselves.

iwalmsley avatar Sep 10 '20 14:09 iwalmsley

Yes the scheduler is being brought up in the Jupyter pod itself

In that case why not use 127.0.0.1 instead of the pod IP?

jacobtomlinson avatar Sep 10 '20 15:09 jacobtomlinson

Not sure I really follow, in order to build the cluster I simply press the +NEW button which creates a cluster (I can't edit the scheduler address myself), when I access via the client I can switch it to 127.0.0.1 which continues to work fine, but that doesn't change the fact I can't open any of the graphs (which normally can be opened fine without even specifying a client in a kernel).

Unless you mean I can somehow specify the scheduler IP in the Dask configuration? I can't really find how to do that looking at the docs and even then the IP would need to be addressable by the workers inside the cluster to work?

iwalmsley avatar Sep 10 '20 15:09 iwalmsley

Ah I thought the extension picks up the address from the client (@ian-r-rose can you confirm?).

What happens if you update the url in top of the extension to point to 127.0.01?

jacobtomlinson avatar Sep 11 '20 15:09 jacobtomlinson

Thanks @jacobtomlinson for the response again. Is this what you mean (see screenshot)? I've tried many variations of ports & IPs, but looking at the console log it seems to be doing everything from the web client side anyway so not sure 127.0.0.1 will work?

image

iwalmsley avatar Sep 16 '20 15:09 iwalmsley

Hmm no that isn't what I meant.

You should be able to do something like /proxy/localhost/8787 or just /proxy/8787 in the URL bar.

jacobtomlinson avatar Sep 17 '20 09:09 jacobtomlinson

Ah right of course sorry!

/proxy/8787 returns not found as it seems to redirect to /status but /proxy/8787/status directly works... that's useful to know as a workaround at least.

iwalmsley avatar Sep 17 '20 10:09 iwalmsley

Changing the URL to have /status at the end doesn't help me. I still get blank tabs for the various graphs. I'm running effectively the same setup: JL, JH, K8s.

GETwss://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers/ws
[HTTP/1.1 500 Internal Server Error 395ms]

Firefox can’t establish a connection to the server at wss://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers/ws. bokeh.min.js:575:1200
[bokeh] Failed to connect to Bokeh server: Could not open websocket bokeh.min.js:575:4429
[bokeh] Failed to load Bokeh session 7iGYQB7z8k3eXJHRRC8mbo8X1KcCxeucqZxhBacODXay: Error: Could not open websocket bokeh.min.js:574:670
Error rendering Bokeh items: Error: Could not open websocket
    _on_error https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    onerror https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    connect https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    connect https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    pull_session https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    d https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:574
    add_document_from_session https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:574
    O https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:164
    embed_items https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:164
    defer https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:185
    defer https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:185
    defer https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:185
    embed_items https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:164
    embed_document https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers:527
    fn https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers:531
    fn https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers:547
    safely https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:583
    fn https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers:521
    EventListener.handleEvent* https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers:551
    <anonymous> https://lsst-lsp-int.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-workers:552

athornton avatar Oct 05 '20 20:10 athornton

So, here's how it doesn't work. If I set the dashboard connect address to proxy/8787/status, the buttons DO light up orange. I am creating a cluster in my notebook programmatically that is listening there rather than with the New Cluster button.

All of the dashboard buttons take me to blank tabs, which are 500s when the browser requests (as in the previous message) a URL like:

      "url": "wss://nublado.lsst.codes/nb/user/athornto/proxy/8787/individual-bandwidth-types/ws",
  

The "new cluster" button gives me a different LocalCluster on its own proxied port and changes the cluster address in the extension pane. The /dask/clusters/? endpoint shows me that one but not the one I programmatically created. I'm pretty sure the problem here is confusion about how to find the server extension endpoints that the lab extension talks to in a K8s environment, but I'm kind of stuck. Do I maybe need to add something to my top-level Jupyterhub Proxy ingress, or the ingress controller, to get URLs in the form of "wss://" going all the way back to me? But if I was missing something like that, then Bokeh would just plain not work, correct? It works fine for rendering in a notebook.

athornton avatar Oct 08 '20 18:10 athornton

@jacobtomlinson pinging you in case this isn't on your radar anymore.

Also since I have somewhat different symptoms (500 vs 404), should I open this as a different issue?

athornton avatar Oct 08 '20 18:10 athornton

Take this with a grain of salt, but it looks like most of the changes in the lab extension were moving-to-async; could it be that something that's working with an in-memory cluster is always completing and has a real result by the time we want it for a dashboard, but the higher latency of the K8s cluster means there's something that the labextension code is expecting to be a resolved result but is still a promise?

athornton avatar Oct 12 '20 19:10 athornton

@athornton I think you are missing a config item in your distributed.yaml that allows bokeh access to the right ports.

See what you get when you do:

import distributed
from dask import config

config.get("distributed.scheduler.dashboard.bokeh-application")

There should be an entry like: 'allow_websocket_origin': ['*']

jsignell avatar Oct 13 '20 14:10 jsignell

Sorry @athornton I've been out on leave. Thanks for picking this up @jsignell.

jacobtomlinson avatar Oct 19 '20 16:10 jacobtomlinson

Alas, there already is that setting:

{'allow_websocket_origin': ['*'], 'keep_alive_milliseconds': 500, 'check_unused_sessions_milliseconds': 500}

athornton avatar Oct 21 '20 20:10 athornton

@jacobtomlinson @jsignell any other ideas? I'm kind of out of them--the only obvious change is the sync->async thing, but I should also mention that I added the jupyter-labhubapp entry point back in to get similar pageconfig-data-to-the-app-injection to what I had in JL 1.x. I don't think that is going to make a difference for Dask, because as far as I know it doesn't use any of the pageconfig stuff, but relies on its own server extension.

athornton avatar Oct 22 '20 20:10 athornton

Sorry I forgot to respond yesterday. I re-read your original posts and have some feedback that might or might not help. You shouldn't expect your programatically created cluster to show up in the /dask/clusters/? that'll only be the ones that the sidepanel is managing.

If you don't use the cluster manager, and just create a cluster and client programatically, can you then click the little search icon next to the dashboard_url input? I think that is a relatively reliable way to get a link to the current client.

I am also wondering if this worked with a previous version of jupyter-server-proxy.

jsignell avatar Oct 23 '20 14:10 jsignell

I'll give that a shot. The extension used to work, in my JL 1.x setup, in that as soon as I created the cluster, the extension would go from gray to orange and I could see the various graphs (I don't think the cluster ever showed up under clusters, which is fair enough). I can take a look to see whether I was using a different jupyter-server-proxy, which doesn't seem unlikely, although it's so simple that I'd be a little surprised if much substantive changed.

I will report back once I've tried the search icon, which...I never noticed before.

athornton avatar Oct 23 '20 17:10 athornton

Doesn't seem to do anything. The connection URL is already correct, and when I start the cluster all the buttons turn from gray to orange. However clicking any of those buttons gives me a blank tab and an Error: could not open websocket in the Javascript console. A representative console trace looks like (and the URL up through 8787 is correct).:

Error rendering Bokeh items: Error: Could not open websocket
    _on_error https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    onerror https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    connect https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    connect https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    pull_session https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:575
    d https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:574
    add_document_from_session https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:574
    O https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:164
    embed_items https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:164
    defer https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:185
    defer https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:185
    defer https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:185
    embed_items https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:164
    embed_document https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-profile:527
    fn https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-profile:531
    fn https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-profile:547
    safely https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/static/js/bokeh.min.js?v=4e38fcb7a8f2989d2b1f9d55dee62dec:583
    fn https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-profile:521
    EventListener.handleEvent* https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-profile:551
    <anonymous> https://lsst-lsp-stable.ncsa.illinois.edu/nb/user/athornto/proxy/8787/individual-profile:552
bokeh.min.js:164:1039

athornton avatar Oct 23 '20 17:10 athornton

Here's an animated GIF of the typical behavior. I start the notebook, launch a cluster (the buttons turn orange), request a client and wait until all 15 requested instances are online, then I read a large dataset and ask for its length. Then while it's doing that I try some of the various buttons, and eventually I remember to capture the error from JavaScript, and then I close the cluster:

dask_labext

athornton avatar Oct 23 '20 17:10 athornton

Can you access the dashboard outside of the extension? I am wondering if the dashboard is having trouble getting in touch with the scheduler.

jsignell avatar Oct 26 '20 15:10 jsignell

I can, but it gives me basically the same result: I get the page with the menu bar at the top. It, and all of the tabs on it, other than info, give me the websocket error. Info gives me the Scheduler TCP socket and logs that seem to indicate normal startup:

distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.108.2.134:42551
distributed.scheduler - INFO - dashboard at: :8787 

athornton avatar Oct 26 '20 17:10 athornton

Ok so this isn't an issue with dask-labextension. You might get better visibility by cross-posting this on dask-kubernetes. I think it's kind of odd that your dashboard address is a jupyterlab hub and your scheduler address is a resolved ip address. You might be missing some kind of proxy to enable that scheduler pod to communicate.

jsignell avatar Oct 26 '20 20:10 jsignell

Good to know it's not an extension problem, in any event.

The dashboard address is the hub because that's the only external endpoint the user gets. The scheduler address is internal to Kubernetes.

In case it wasn't clear, though: the clusters all work fine. My jobs run successfully. The problem is in seeing what they're doing from the various bits of the status dashboard.

athornton avatar Oct 26 '20 22:10 athornton

What's the status here? From what I read, this is still an issue related to a recent change to the jupyter-server-proxy package. It appears as though we simply need to set ServerProxy.host_whitelist to include the scheduler Pod IP in Kubernetes but I'm unsure how that's done via envvar or YAML/JSON file configuration (i.e. where in the config hierarchy does that config belong).

Does anyone have any insight here?

sonnysideup avatar Jun 17 '21 19:06 sonnysideup