Ability to ignore specific files/folders in ModelTrainer's script mode
Describe the feature you'd like
I do not want having .git, .env, .vscode, data, __pycache__ or any irrelevant files/folders to be uploaded to S3 artifacts when I use script mode of ModelTrainer in SourceCode. Moreover, coping source_dir may be time-consuming due to the large number of files, such as .git or/and .env.
How would this feature be used? Please describe. During development/sanity-checing on local machine I have some unnecessary files/folders. The idea is to not upload them during using a script mode. Let's say I have project structure:
tree
.
├── README.md
├── .git
├── pipeline1
│ ├── train.py
│ ├── data
│ ├── .env
│ ├── __pycache__
│ │ ├── __init__.cpython-310.pyc
├── pipeline2
│ ├── train.py
│ ├── data
│ ├── __pycache__
│ │ ├── __init__.cpython-310.pyc
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import SourceCode, Compute
image = "<image>"
source_code = SourceCode(
source_dir="pipeline1",
command="python train.py"
)
# or
# source_code = SourceCode(
# source_dir=".",
# command="python -m pipeline1.train")
compute = Compute(
instance_count=1,
instance_type="ml.g5.8xlarge"
)
model_trainer = ModelTrainer(
training_image=image,
source_code=source_code,
compute=compute,
)
model_trainer.train()
expected result:
s3://<default_bucket_path>/<base_job_name>/input/code/ w/o ignored files/folders
Describe alternatives you've considered
- Create TempDir, copy script w/o unnecessary files and pass it as source_dir.
- Re-design the project structure to have unwanted files/dirs outside the script you want to upload. (
- Always use BYOC instead
scriptmode (isn't practical for some cases, e.g. sanity-check)
Additional context Add any other context or screenshots about the feature request here.
Seems similar request to this issue we have received in the past - https://github.com/aws/sagemaker-python-sdk/issues/2875
Thanks for a response @benieric .
The proposal in https://github.com/aws/sagemaker-python-sdk/issues/2875 looks fine to me. It seems that using a .sagemakerignore file - similar to a .gitignore or .dockerignore - could be an effective solution.
In my case, I rely exclusively on BYOC mode. I also was tested script mode, but I rejected it because it too long to execute. For example, every training run copies a large number of files, resulting in over 30 minutes to upload the necessary folder to S3. Just FYI what I have locally
> find .git -type f | wc -l
3384
> find .env -type f | wc -l
81047
Hi @benieric , What do you think to apply this kind of solution? It should work. Any potential drawbacks?
ignore_patterns = ['.env', '.git', 'data', '__pycache__']
source_code = SourceCode(source_dir="source", entry_script="train.py", ignore_patterns=ignore_patterns)
...
if self.source_code.source_dir:
source_code_channel = self.create_input_data_channel(
channel_name=SM_CODE,
data_source=self.source_code.source_dir,
key_prefix=input_data_key_prefix,
ignore_patterns=self.source_code.ignore_patterns,
)
...
def create_input_data_channel(
self,
channel_name: str,
data_source: DataSourceType,
key_prefix: Optional[str] = None,
ignore_patterns: Optional[List[str]] = None,
) -> Channel:
...
tmp_dir = TemporaryDirectory()
shutil.copytree(
data_source,
os.path.join(tmp_dir.name, data_source),
dirs_exist_ok=True,
ignore=shutil.ignore_patterns(*ignore_patterns)
)
s3_uri = self.sagemaker_session.upload_data(
path=tmp_dir.name,
bucket=self.sagemaker_session.default_bucket(),
key_prefix=key_prefix,
)
@benieric @nargokul What's your opinion on proposed solution? Any better options?