sagemaker-python-sdk Ability to ignore specific files/folders in ModelTrainer's script mode

Describe the feature you'd like I do not want having .git, .env, .vscode, data, __pycache__ or any irrelevant files/folders to be uploaded to S3 artifacts when I use script mode of ModelTrainer in SourceCode. Moreover, coping source_dir may be time-consuming due to the large number of files, such as .git or/and .env.

How would this feature be used? Please describe. During development/sanity-checing on local machine I have some unnecessary files/folders. The idea is to not upload them during using a script mode. Let's say I have project structure:

tree
.
├── README.md
├── .git
├── pipeline1
│       ├── train.py
│       ├── data
│       ├── .env
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc
├── pipeline2
│       ├── train.py
│       ├── data
│       ├── __pycache__
│       │       ├── __init__.cpython-310.pyc

from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute

image = "<image>"

source_code = SourceCode(
    source_dir="pipeline1",
    command="python train.py"
)

# or 
# source_code = SourceCode(
# source_dir=".",
# command="python -m pipeline1.train")

compute = Compute(
   instance_count=1,
   instance_type="ml.g5.8xlarge"
)

model_trainer = ModelTrainer(
    training_image=image,
    source_code=source_code,
     compute=compute,
)
model_trainer.train()

expected result: s3://<default_bucket_path>/<base_job_name>/input/code/ w/o ignored files/folders

Describe alternatives you've considered

Create TempDir, copy script w/o unnecessary files and pass it as source_dir.
Re-design the project structure to have unwanted files/dirs outside the script you want to upload. (
Always use BYOC instead script mode (isn't practical for some cases, e.g. sanity-check)

Additional context Add any other context or screenshots about the feature request here.

Mar 18 '25 09:03 discort

Seems similar request to this issue we have received in the past - https://github.com/aws/sagemaker-python-sdk/issues/2875

Mar 19 '25 23:03 benieric

Thanks for a response @benieric .

The proposal in https://github.com/aws/sagemaker-python-sdk/issues/2875 looks fine to me. It seems that using a .sagemakerignore file - similar to a .gitignore or .dockerignore - could be an effective solution.

In my case, I rely exclusively on BYOC mode. I also was tested script mode, but I rejected it because it too long to execute. For example, every training run copies a large number of files, resulting in over 30 minutes to upload the necessary folder to S3. Just FYI what I have locally

> find .git -type f | wc -l

    3384

> find .env -type f | wc -l

   81047

Mar 20 '25 14:03 discort

Hi @benieric , What do you think to apply this kind of solution? It should work. Any potential drawbacks?

ignore_patterns = ['.env', '.git', 'data', '__pycache__']
source_code = SourceCode(source_dir="source", entry_script="train.py", ignore_patterns=ignore_patterns)
...

if self.source_code.source_dir:
    source_code_channel = self.create_input_data_channel(
        channel_name=SM_CODE,
        data_source=self.source_code.source_dir,
        key_prefix=input_data_key_prefix,
        ignore_patterns=self.source_code.ignore_patterns,
    )

...
def create_input_data_channel(
        self,
        channel_name: str,
        data_source: DataSourceType,
        key_prefix: Optional[str] = None,
        ignore_patterns: Optional[List[str]] = None,
    ) -> Channel:
    ...
    tmp_dir = TemporaryDirectory()
    shutil.copytree(
        data_source,
        os.path.join(tmp_dir.name, data_source),
        dirs_exist_ok=True,
        ignore=shutil.ignore_patterns(*ignore_patterns)
    )
    s3_uri = self.sagemaker_session.upload_data(
        path=tmp_dir.name,
        bucket=self.sagemaker_session.default_bucket(),
        key_prefix=key_prefix,
    )

Apr 07 '25 08:04 discort

@benieric @nargokul What's your opinion on proposed solution? Any better options?

Apr 23 '25 13:04 discort