MachineLearningNotebooks icon indicating copy to clipboard operation
MachineLearningNotebooks copied to clipboard

Run hanging on request for unknown reason

Open byteandbark opened this issue 3 years ago • 0 comments

Running a hyperdrive run with a pytorch model I have some of my child runs hanging with a request in the log as below:

2022-04-20 12:27:52,149|azureml.BatchTaskQueueAdd_1_Batches.WaitFlushSource:BatchTaskQueueAdd_1_Batches|DEBUG|[STOP] 2022-04-20 12:27:52,273|azureml._SubmittedRun#HD_e32fc294-eadf-405e-96ac-970bdd35bc49_3.RunHistoryFacade.MetricsClient._post_run_metrics_log_failed_validations-async:False|DEBUG|[STOP] 2022-04-20 12:28:11,735|azureml.core.authentication|DEBUG|Time to expire 1813610.265009 seconds 2022-04-20 12:28:41,739|azureml.core.authentication|DEBUG|Time to expire 1813580.260927 seconds 2022-04-20 12:29:11,739|azureml.core.authentication|DEBUG|Time to expire 1813550.260178 seconds

Expected behaviour: Run to complete and give metrics

Actual behaviour: Log shows run is hanging and not posting metrics

Any help with why this is happening is appreciated (or if this is posted in the wrong place I am happy to be directed to the correct place to post this issue)

byteandbark avatar Apr 20 '22 12:04 byteandbark