dask-cloudprovider icon indicating copy to clipboard operation
dask-cloudprovider copied to clipboard

AzureVMCluster throwing raise FatalCommClosedError() from err distributed.comm.core.FatalCommClosedError

Open manuelreyesgomez opened this issue 5 years ago • 8 comments

On a new python 3.7 conda environment

$ pip install dask-cloudprovider[azure] $ az login $ python

from dask_cloudprovider.azure import AzureVMCluster

resource_group = "NGC-AML-Quick-Launch"
vnet="NGC-AML-Quick-Launch-vnet"
security_group="NGC-AML-Quick-Launch-nsg"
initial_node_count = 2
vm_name = "Standard_NC6s_v3"
location = "South Central US"
base_dockerfile = "rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.7"

cluster = AzureVMCluster(
    resource_group=resource_group,
    location = location,
    vnet=vnet,
    security_group=security_group,
    n_workers=initial_node_count,
    vm_size=vm_name,
    docker_image=base_dockerfile,
    docker_args="--privileged",
    worker_class="dask_cuda.CUDAWorker")
Creating scheduler instance
Assigned public IP
Network interface ready
Creating VM
Created VM dask-455260e7-scheduler
Waiting for scheduler to run at 13.84.221.226:8786
Scheduler is running
Traceback (most recent call last):
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\comm\tcp.py", line 363, in connect
    ip, port, max_buffer_size=MAX_BUFFER_SIZE, **kwargs
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\tornado\tcpclient.py", line 289, in connect
    False, ssl_options=ssl_options, server_hostname=host
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\tornado\iostream.py", line 1391, in _do_ssl_handshake
    self.socket.do_handshake()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1091)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\dask_cloudprovider\azure\azurevm.py", line 496, in __init__
    super().__init__(**kwargs)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 284, in __init__
    super().__init__(**kwargs, security=self.security)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\spec.py", line 281, in __init__
    self.sync(self._start)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\cluster.py", line 189, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\utils.py", line 324, in f
    result[0] = yield future
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\tornado\gen.py", line 762, in run
    value = future.result()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 324, in _start
    await super()._start()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\spec.py", line 314, in _start
    await super()._start()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\cluster.py", line 73, in _start
    comm = await self.scheduler_comm.live_comm()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\core.py", line 747, in live_comm
    **self.connection_args,
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\comm\core.py", line 288, in connect
    timeout=min(intermediate_cap, time_left()),
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\asyncio\tasks.py", line 442, in wait_for
    return fut.result()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\comm\tcp.py", line 376, in connect
    raise FatalCommClosedError() from err
distributed.comm.core.FatalCommClosedError
  • Dask version:
  • Python version:3.7
  • Operating System:
  • Install method (conda, pip, source):pip

manuelreyesgomez avatar Jan 28 '21 18:01 manuelreyesgomez

Cluster connections are now secure by default.

Looks like you may need to update your local version of openssl.

Alternatively as a workaround try setting the security=False kwarg.

jacobtomlinson avatar Jan 29 '21 09:01 jacobtomlinson

security=False, does not seem to work as I am getting the following:

Waiting for scheduler to run at 52.171.62.23:8786 Scheduler is running Traceback (most recent call last): File "", line 11, in File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\dask_cloudprovider\azure\azurevm.py", line 496, in init super().init(**kwargs) File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 284, in init super().init(**kwargs, security=self.security) File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\spec.py", line 281, in init self.sync(self._start) File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\cluster.py", line 189, in sync return sync(self.loop, func, *args, **kwargs) File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\utils.py", line 340, in sync raise exc.with_traceback(tb) File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\utils.py", line 324, in f result[0] = yield future File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\tornado\gen.py", line 762, in run value = future.result() File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 324, in _start await super()._start() File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\spec.py", line 314, in _start await super()._start() File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\deploy\cluster.py", line 73, in _start comm = await self.scheduler_comm.live_comm() File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\core.py", line 747, in live_comm **self.connection_args, File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\comm\core.py", line 288, in connect timeout=min(intermediate_cap, time_left()), File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\asyncio\tasks.py", line 442, in wait_for return fut.result() File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\comm\tcp.py", line 357, in connect self._check_encryption(address, connection_args) File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_7__01_28_2021_no_change\lib\site-packages\distributed\comm\tcp.py", line 347, in _check_encryption "refusing communication from/to %r" % (self.prefix + address,) RuntimeError: encryption required by Dask configuration, refusing communication from/to 'tcp://52.171.62.23:8786'

manuelreyesgomez avatar Jan 29 '21 18:01 manuelreyesgomez

Have you tried updating ssl ?

quasiben avatar Jan 29 '21 18:01 quasiben

@quasiben

I am using a pretty recent one

openssl version OpenSSL 1.1.1i 8 Dec 2020

manuelreyesgomez avatar Feb 02 '21 00:02 manuelreyesgomez

I tried removing argument

docker_args="--privileged",

Same behavior

manuelreyesgomez avatar Feb 02 '21 00:02 manuelreyesgomez

I ran it from the RAPIDS container, same result

manuelreyesgomez avatar Feb 02 '21 01:02 manuelreyesgomez

@quasiben Many people had tried now and are getting same error. With recent openssl version 12/2020.

manuelreyesgomez avatar Feb 09 '21 00:02 manuelreyesgomez

security=False works on a linux machine, would check again if I still get the errors I reported above on a windows machine

manuelreyesgomez avatar Feb 09 '21 01:02 manuelreyesgomez