amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

[Bug Report] Sagemaker debugger issue - CreateXgboostReport processing job fails when training XGBoost image version >= 1.3-1

Open morelen17 opened this issue 3 years ago • 0 comments

Link to the notebook Add the link to the notebook.

Describe the bug If you run Amazon SageMaker Debugger XGBoost training report for Higgs Boson Detection Challenge notebook (latest version to date) from the sagemaker-examples repo with newer xgboost container versions 1.3-1 or 1.5-1, then, as a result, you get CreateXgboostReport processing job failed.

To reproduce Replace

xgboost_container = image_uris.retrieve("xgboost", region, "1.2-1")

line of code with the newer container versions - 1.3-1 or 1.5-1

xgboost_container = image_uris.retrieve("xgboost", region, "1.3-1")  # or 1.5-1

and run the whole notebook.

Logs

[2022-11-07 13:29:11.235 <...> INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-11-07 13:29:13.678 <...> INFO local_trial.py:35] Loading trial base_trial at path /opt/ml/processing/input/tensors
Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded
Traceback (most recent call last):
  File "evaluate.py", line 119, in _create_trials
    range_steps=(self.start_step, self.end_step))
  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/utils.py", line 25, in create_trial
    return LocalTrial(name=name, dirname=path, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/local_trial.py", line 36, in __init__
    self._load_collections()
  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 168, in _load_collections
    _wait_for_collection_files(1)  # wait for the first collection file
  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 165, in _wait_for_collection_files
    raise MissingCollectionFiles
smdebug.exceptions.MissingCollectionFiles: Training job has ended. All the collection files could not be loaded

I see several possible issue sources, however, I am not sure which one is the one:

  • XGBoost container images:
    • from 1.2-1 up to the latest version to date 1.5-1 code pieces that contain smdebug library related code have not been changed.
  • smdebug library:
    • although the version of the library listed in requirements.txt had changed from smdebug==1.0.7 in 1.2-1 to smdebug==1.0.10 in 1.2-2 and later, running the example notebook with xgboost container version 1.2-2 worked totally fine.
  • 972752614525.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-debugger-rules:latest image:
    • in all cases, this image version has been used. I doubt that the image had been updated while I was running my "experiments".
  • sagemaker==2.112.2.

morelen17 avatar Nov 07 '22 15:11 morelen17