I'm using this as reference and whenever I run a distributed computation like

MCVE

import dask
from dask.distributed import Client, progress
from dask import compute, delayed
from dask_cloudprovider import FargateCluster

%%time
cpu = 1
ram = 2

cluster = FargateCluster(n_workers=1,
                         image='rpanai/fargate-worker:2020-08-06',
                         vpc="my_vpc",
                         subnets=["subnet-1", "subnet-1"],
                         worker_cpu=int(cpu * 1024),
                         worker_mem=int(ram * 1024),
                         cloudwatch_logs_group="my_log_group",
                         task_role_policies=['arn:aws:iam::aws:policy/AmazonS3FullAccess'],
                         scheduler_timeout='10 minutes'
                        )

cluster.adapt(minimum=1,
                     maximum=40) 
client = Client(cluster)
client

def fun(fn1):
    fn2 =  fn1.replace("fldr1", "fldr2")
    fn_out = fn1.replace("fldr1", "fldr_out")
    df1 =  pd.read_parquet(fn1)
    df2 = pd.read_parquet(fn2)
    df1 = pd.merge(df1, df2)
    # stuff
    df1.to_parquet(fn_out)

to_process = [delayed(fun)(el)
               for el in lst]
out = compute(to_process)

Warning

RuntimeWarning: get_log_events rate limit exceeded, retrying after delay.

Environment:

Dask version: 2.14
Python version: 3.6.10
Operating System: rhel fedora
Install method (conda, pip, source): conda

Dockerfile:

FROM continuumio/miniconda3:4.7.12

RUN conda install --yes \
    -c conda-forge \
    python=3.6.10 \
    python-blosc \
    cytoolz \
    dask==2.14.0 \
    dask-ml=1.6.0 \
    dask-xgboost=0.1.11 \
    msgpack-python=1.0.0 \
    nomkl \
    numpy==1.19.1 \
    pandas==1.1.0 \
    numba=0.50.1 \
    pyarrow=1.0.0 \
    tini==0.18.0 \
    pip \
    s3fs \
    && conda clean -tipsy \
    && find /opt/conda/ -type f,l -name '*.a' -delete \
    && find /opt/conda/ -type f,l -name '*.pyc' -delete \
    && find /opt/conda/ -type f,l -name '*.js.map' -delete \
    && find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
    && rm -rf /opt/conda/pkgs

COPY prepare.sh /usr/bin/prepare.sh

RUN mkdir /opt/app

ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]

Aug 14 '20 03:08 rpanai

Thanks @rpanai. Could you please share a complete reproducible example including cluster setup?

Aug 14 '20 09:08 jacobtomlinson

Hi @jacobtomlinson I'll update my issue with details but it's not going to be reproducible as it's using access to a private S3 bucket.

Aug 14 '20 13:08 rpanai

Thanks @rpanai. Could you please share a complete reproducible example including cluster setup?

@jacobtomlinson updated!

Aug 14 '20 14:08 rpanai

Thanks @rpanai. Could you also share what the problem is? Does the computation not complete?

Aug 17 '20 10:08 jacobtomlinson

@jacobtomlinson I'll try to create a mcve with some data available on S3. In general it complete the job but if I add some more workers it could stop the computation.

Aug 17 '20 13:08 rpanai

@rpanai I guess my point is that the warning you shared is unrelated, it is just a warning and can be ignored, it is mostly to give an indication of why things are starting up slowly.

But you didn't give any other indication of what was actually broken here.

In general it complete the job but if I add some more workers it could stop the computation.

Could you expand a little more on this?

Aug 17 '20 14:08 jacobtomlinson

Hi @jacobtomlinson, I've just had a computation seemingly stop dead in its tracks, and the last log message is:

/usr/local/lib/python3.6/dist-packages/dask_cloudprovider/providers/aws/ecs.py:334: RuntimeWarning: get_log_events rate limit exceeded, retrying after delay.

Is there any way this could be blocking in some way? All the workers are still up and healthy, as are the scheduler and the task container. :thinking:

Sep 04 '20 09:09 valpesendorfer

Not that I'm aware of. I'm not entirely sure why the logs call is being made. Are you retrieving the logs in some way?

Sep 14 '20 14:09 jacobtomlinson