dask-cloudprovider icon indicating copy to clipboard operation
dask-cloudprovider copied to clipboard

RuntimeWarning: get_log_events rate limit exceeded

Open rpanai opened this issue 5 years ago • 8 comments

I'm using this as reference and whenever I run a distributed computation like

MCVE

import dask
from dask.distributed import Client, progress
from dask import compute, delayed
from dask_cloudprovider import FargateCluster

%%time
cpu = 1
ram = 2

cluster = FargateCluster(n_workers=1,
                         image='rpanai/fargate-worker:2020-08-06',
                         vpc="my_vpc",
                         subnets=["subnet-1", "subnet-1"],
                         worker_cpu=int(cpu * 1024),
                         worker_mem=int(ram * 1024),
                         cloudwatch_logs_group="my_log_group",
                         task_role_policies=['arn:aws:iam::aws:policy/AmazonS3FullAccess'],
                         scheduler_timeout='10 minutes'
                        )

cluster.adapt(minimum=1,
                     maximum=40) 
client = Client(cluster)
client
def fun(fn1):
    fn2 =  fn1.replace("fldr1", "fldr2")
    fn_out = fn1.replace("fldr1", "fldr_out")
    df1 =  pd.read_parquet(fn1)
    df2 = pd.read_parquet(fn2)
    df1 = pd.merge(df1, df2)
    # stuff
    df1.to_parquet(fn_out)

to_process = [delayed(fun)(el)
               for el in lst]
out = compute(to_process)

Warning

RuntimeWarning: get_log_events rate limit exceeded, retrying after delay.

Environment:

  • Dask version: 2.14
  • Python version: 3.6.10
  • Operating System: rhel fedora
  • Install method (conda, pip, source): conda

Dockerfile:

FROM continuumio/miniconda3:4.7.12

RUN conda install --yes \
    -c conda-forge \
    python=3.6.10 \
    python-blosc \
    cytoolz \
    dask==2.14.0 \
    dask-ml=1.6.0 \
    dask-xgboost=0.1.11 \
    msgpack-python=1.0.0 \
    nomkl \
    numpy==1.19.1 \
    pandas==1.1.0 \
    numba=0.50.1 \
    pyarrow=1.0.0 \
    tini==0.18.0 \
    pip \
    s3fs \
    && conda clean -tipsy \
    && find /opt/conda/ -type f,l -name '*.a' -delete \
    && find /opt/conda/ -type f,l -name '*.pyc' -delete \
    && find /opt/conda/ -type f,l -name '*.js.map' -delete \
    && find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
    && rm -rf /opt/conda/pkgs

COPY prepare.sh /usr/bin/prepare.sh

RUN mkdir /opt/app

ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]

rpanai avatar Aug 14 '20 03:08 rpanai

Thanks @rpanai. Could you please share a complete reproducible example including cluster setup?

jacobtomlinson avatar Aug 14 '20 09:08 jacobtomlinson

Hi @jacobtomlinson I'll update my issue with details but it's not going to be reproducible as it's using access to a private S3 bucket.

rpanai avatar Aug 14 '20 13:08 rpanai

Thanks @rpanai. Could you please share a complete reproducible example including cluster setup?

@jacobtomlinson updated!

rpanai avatar Aug 14 '20 14:08 rpanai

Thanks @rpanai. Could you also share what the problem is? Does the computation not complete?

jacobtomlinson avatar Aug 17 '20 10:08 jacobtomlinson

@jacobtomlinson I'll try to create a mcve with some data available on S3. In general it complete the job but if I add some more workers it could stop the computation.

rpanai avatar Aug 17 '20 13:08 rpanai

@rpanai I guess my point is that the warning you shared is unrelated, it is just a warning and can be ignored, it is mostly to give an indication of why things are starting up slowly.

But you didn't give any other indication of what was actually broken here.

In general it complete the job but if I add some more workers it could stop the computation.

Could you expand a little more on this?

jacobtomlinson avatar Aug 17 '20 14:08 jacobtomlinson

Hi @jacobtomlinson, I've just had a computation seemingly stop dead in its tracks, and the last log message is:

/usr/local/lib/python3.6/dist-packages/dask_cloudprovider/providers/aws/ecs.py:334: RuntimeWarning: get_log_events rate limit exceeded, retrying after delay.

Is there any way this could be blocking in some way? All the workers are still up and healthy, as are the scheduler and the task container. :thinking:

valpesendorfer avatar Sep 04 '20 09:09 valpesendorfer

Not that I'm aware of. I'm not entirely sure why the logs call is being made. Are you retrieving the logs in some way?

jacobtomlinson avatar Sep 14 '20 14:09 jacobtomlinson