MachineLearningNotebooks icon indicating copy to clipboard operation
MachineLearningNotebooks copied to clipboard

Failed to flush task queue within 120 seconds

Open Jaswanth-Reddy-S opened this issue 3 years ago • 3 comments

Iam training my ML model using Scriptrunconfig method using Azure ML service, after training my model, it is throwing this error Failed to flush task queue within 120 seconds Message: Failed to flush task queue within 120 seconds InnerException None ErrorResponse { "error": { "code": "UserError", "message": "Failed to flush task queue within 120 seconds", "inner_error": { "code": "ResourceExhausted", "inner_error": { "code": "Timeout" } } } }

Jaswanth-Reddy-S avatar Feb 21 '22 18:02 Jaswanth-Reddy-S

Looks like you are trying to download large checkpoint files which is resulting in timeout failure while flushing. The default time for flushing the queue is 120 second. But as a workaround, could you try using run.download_file instead of run.download_files and that should allow downloading large checkpoint files.

Sample is available here: https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py#azureml-core-run-run-download-files

harneetvirk avatar Feb 23 '22 17:02 harneetvirk

I have this same issue now as well. It only started occurring in a recent pipeline rebuild that was not having this problem before. Is there a way to extend the timeout.

jvschoen avatar Jul 28 '22 17:07 jvschoen

Some forest models are multiple (10s of) GBs in size, so it seems like if there are bottlenecks in the upload then this timeout doesn't allow enough time for the model upload.

jvschoen avatar Jul 28 '22 17:07 jvschoen

timeout because of upload and download size size was taking more time

Jaswanth-Reddy-S avatar Feb 10 '23 10:02 Jaswanth-Reddy-S