amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
[Bug Report] Sagemaker debugger issue - CreateXgboostReport processing job fails when training XGBoost image version >= 1.3-1
Link to the notebook Add the link to the notebook.
Describe the bug
If you run Amazon SageMaker Debugger XGBoost training report for Higgs Boson Detection Challenge notebook (latest version to date) from the sagemaker-examples repo with newer xgboost container versions 1.3-1 or 1.5-1, then, as a result, you get CreateXgboostReport processing job failed.
To reproduce Replace
xgboost_container = image_uris.retrieve("xgboost", region, "1.2-1")
line of code with the newer container versions - 1.3-1 or 1.5-1
xgboost_container = image_uris.retrieve("xgboost", region, "1.3-1") # or 1.5-1
and run the whole notebook.
Logs
[2022-11-07 13:29:11.235 <...> INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-11-07 13:29:13.678 <...> INFO local_trial.py:35] Loading trial base_trial at path /opt/ml/processing/input/tensors
Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded
Traceback (most recent call last):
File "evaluate.py", line 119, in _create_trials
range_steps=(self.start_step, self.end_step))
File "/usr/local/lib/python3.7/site-packages/smdebug/trials/utils.py", line 25, in create_trial
return LocalTrial(name=name, dirname=path, **kwargs)
File "/usr/local/lib/python3.7/site-packages/smdebug/trials/local_trial.py", line 36, in __init__
self._load_collections()
File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 168, in _load_collections
_wait_for_collection_files(1) # wait for the first collection file
File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 165, in _wait_for_collection_files
raise MissingCollectionFiles
smdebug.exceptions.MissingCollectionFiles: Training job has ended. All the collection files could not be loaded
I see several possible issue sources, however, I am not sure which one is the one:
- XGBoost container images:
- from
1.2-1up to the latest version to date1.5-1code pieces that containsmdebuglibrary related code have not been changed.
- from
-
smdebuglibrary:- although the version of the library listed in
requirements.txthad changed fromsmdebug==1.0.7in1.2-1tosmdebug==1.0.10in1.2-2and later, running the example notebook with xgboost container version1.2-2worked totally fine.
- although the version of the library listed in
-
972752614525.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-debugger-rules:latestimage:- in all cases, this image version has been used. I doubt that the image had been updated while I was running my "experiments".
-
sagemaker==2.112.2.