RuntimeWarning: get_log_events rate limit exceeded
I'm using this as reference and whenever I run a distributed computation like
MCVE
import dask
from dask.distributed import Client, progress
from dask import compute, delayed
from dask_cloudprovider import FargateCluster
%%time
cpu = 1
ram = 2
cluster = FargateCluster(n_workers=1,
image='rpanai/fargate-worker:2020-08-06',
vpc="my_vpc",
subnets=["subnet-1", "subnet-1"],
worker_cpu=int(cpu * 1024),
worker_mem=int(ram * 1024),
cloudwatch_logs_group="my_log_group",
task_role_policies=['arn:aws:iam::aws:policy/AmazonS3FullAccess'],
scheduler_timeout='10 minutes'
)
cluster.adapt(minimum=1,
maximum=40)
client = Client(cluster)
client
def fun(fn1):
fn2 = fn1.replace("fldr1", "fldr2")
fn_out = fn1.replace("fldr1", "fldr_out")
df1 = pd.read_parquet(fn1)
df2 = pd.read_parquet(fn2)
df1 = pd.merge(df1, df2)
# stuff
df1.to_parquet(fn_out)
to_process = [delayed(fun)(el)
for el in lst]
out = compute(to_process)
Warning
RuntimeWarning: get_log_events rate limit exceeded, retrying after delay.
Environment:
- Dask version: 2.14
- Python version: 3.6.10
- Operating System: rhel fedora
- Install method (conda, pip, source): conda
Dockerfile:
FROM continuumio/miniconda3:4.7.12
RUN conda install --yes \
-c conda-forge \
python=3.6.10 \
python-blosc \
cytoolz \
dask==2.14.0 \
dask-ml=1.6.0 \
dask-xgboost=0.1.11 \
msgpack-python=1.0.0 \
nomkl \
numpy==1.19.1 \
pandas==1.1.0 \
numba=0.50.1 \
pyarrow=1.0.0 \
tini==0.18.0 \
pip \
s3fs \
&& conda clean -tipsy \
&& find /opt/conda/ -type f,l -name '*.a' -delete \
&& find /opt/conda/ -type f,l -name '*.pyc' -delete \
&& find /opt/conda/ -type f,l -name '*.js.map' -delete \
&& find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
&& rm -rf /opt/conda/pkgs
COPY prepare.sh /usr/bin/prepare.sh
RUN mkdir /opt/app
ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]
Thanks @rpanai. Could you please share a complete reproducible example including cluster setup?
Hi @jacobtomlinson I'll update my issue with details but it's not going to be reproducible as it's using access to a private S3 bucket.
Thanks @rpanai. Could you please share a complete reproducible example including cluster setup?
@jacobtomlinson updated!
Thanks @rpanai. Could you also share what the problem is? Does the computation not complete?
@jacobtomlinson I'll try to create a mcve with some data available on S3. In general it complete the job but if I add some more workers it could stop the computation.
@rpanai I guess my point is that the warning you shared is unrelated, it is just a warning and can be ignored, it is mostly to give an indication of why things are starting up slowly.
But you didn't give any other indication of what was actually broken here.
In general it complete the job but if I add some more workers it could stop the computation.
Could you expand a little more on this?
Hi @jacobtomlinson, I've just had a computation seemingly stop dead in its tracks, and the last log message is:
/usr/local/lib/python3.6/dist-packages/dask_cloudprovider/providers/aws/ecs.py:334: RuntimeWarning: get_log_events rate limit exceeded, retrying after delay.
Is there any way this could be blocking in some way? All the workers are still up and healthy, as are the scheduler and the task container. :thinking:
Not that I'm aware of. I'm not entirely sure why the logs call is being made. Are you retrieving the logs in some way?