Failed to flush task queue within 120 seconds
Iam training my ML model using Scriptrunconfig method using Azure ML service, after training my model, it is throwing this error Failed to flush task queue within 120 seconds Message: Failed to flush task queue within 120 seconds InnerException None ErrorResponse { "error": { "code": "UserError", "message": "Failed to flush task queue within 120 seconds", "inner_error": { "code": "ResourceExhausted", "inner_error": { "code": "Timeout" } } } }
Looks like you are trying to download large checkpoint files which is resulting in timeout failure while flushing. The default time for flushing the queue is 120 second. But as a workaround, could you try using run.download_file instead of run.download_files and that should allow downloading large checkpoint files.
Sample is available here: https://docs.microsoft.com/python/api/azureml-core/azureml.core.run.run?view=azure-ml-py#azureml-core-run-run-download-files
I have this same issue now as well. It only started occurring in a recent pipeline rebuild that was not having this problem before. Is there a way to extend the timeout.
Some forest models are multiple (10s of) GBs in size, so it seems like if there are bottlenecks in the upload then this timeout doesn't allow enough time for the model upload.
timeout because of upload and download size size was taking more time