MachineLearningNotebooks Failed to flush task queue within 120 seconds

Iam training my ML model using Scriptrunconfig method using Azure ML service, after training my model, it is throwing this error Failed to flush task queue within 120 seconds Message: Failed to flush task queue within 120 seconds InnerException None ErrorResponse { "error": { "code": "UserError", "message": "Failed to flush task queue within 120 seconds", "inner_error": { "code": "ResourceExhausted", "inner_error": { "code": "Timeout" } } } }

Feb 21 '22 18:02 Jaswanth-Reddy-S

Looks like you are trying to download large checkpoint files which is resulting in timeout failure while flushing. The default time for flushing the queue is 120 second. But as a workaround, could you try using run.download_file instead of run.download_files and that should allow downloading large checkpoint files.

Sample is available here: https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py#azureml-core-run-run-download-files

Feb 23 '22 17:02 harneetvirk

I have this same issue now as well. It only started occurring in a recent pipeline rebuild that was not having this problem before. Is there a way to extend the timeout.

Jul 28 '22 17:07 jvschoen

Some forest models are multiple (10s of) GBs in size, so it seems like if there are bottlenecks in the upload then this timeout doesn't allow enough time for the model upload.

Jul 28 '22 17:07 jvschoen

timeout because of upload and download size size was taking more time

Feb 10 '23 10:02 Jaswanth-Reddy-S