deep-learning-containers [bug] OOM for GPU while using recommended batch

Hi AWS Team,

I'm following your documentation for training language models (bert-base-uncased) and I'm not able to use batch_size=12 nor 28 (if training compiler is used) having g4n.2xlarge instance due to OOM for GPU.

Values are taken for specific language model from link below https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html

Sagemaker version: 2.86.2

batch_size = 12

hyperparameters={'epochs': 1,
                 'model_name': 'bert-base-uncased',
                 "n_gpus": 1,
                 "train_batch_size": batch_size,
                 }

# Scale the learning rate by batch size, as original LR was using batch size of 32
hyperparameters["learning_rate"] = float("5e-5") / 32 * hyperparameters["train_batch_size"]

# Scale the volume size by number of epochs
volume_size = 60 + 2 * hyperparameters["epochs"]


huggingface_estimator = HuggingFace(
    entry_point          = 'simcse_unsupervised.py',
    source_dir           = './code',
    instance_type        = 'ml.g4dn.2xlarge', # GPU supported by Hugging Face 'ml.p3.2xlarge'
    instance_count       = 1,
    role                 = role,
    transformers_version = '4.17.0',          # '4.11.0' does not have image with eu-central region          
    pytorch_version      = '1.10.2',              # '1.9.0' does not work with eu-central region
    py_version           = 'py38',
    hyperparameters      = hyperparameters,
    compiler_config      = TrainingCompilerConfig(),
    environment          = {'GPU_NUM_DEVICES':'1'},
    volume_size          = volume_size,
    disable_profiler     = True,
    debugger_hook_config = False)

    train_data = [InputExample(texts=[s, s]) for s in train_sentences]
    train_dataloader = DataLoader(train_data, batch_size=args.train_batch_size, shuffle=True)
    train_loss = losses.MultipleNegativesRankingLoss(model)
     
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=args.epochs,
        weight_decay=0,
        show_progress_bar=True,
        scheduler='constantlr',
        optimizer_params={'lr': args.learning_rate},
    )

2022-05-19 09:51:42,385 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {
        "sagemaker_training_compiler_debug_mode": false,
        "sagemaker_training_compiler_enabled": true
    },
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "epochs": 1,
        "learning_rate": 1.8750000000000002e-05,
        "model_name": "bert-base-uncased",
        "n_gpus": 1,
        "train_batch_size": 12
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "huggingface-pytorch-trcomp-training-2022-05-19-09-43-36-377",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-eu-central-1-886370516942/huggingface-pytorch-trcomp-training-2022-05-19-09-43-36-377/source/sourcedir.tar.gz",
    "module_name": "simcse_unsupervised",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g4dn.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g4dn.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "simcse_unsupervised.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"epochs":1,"learning_rate":1.8750000000000002e-05,"model_name":"bert-base-uncased","n_gpus":1,"train_batch_size":12}
SM_USER_ENTRY_POINT=simcse_unsupervised.py
SM_FRAMEWORK_PARAMS={"sagemaker_training_compiler_debug_mode":false,"sagemaker_training_compiler_enabled":true}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=simcse_unsupervised
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-central-1-886370516942/huggingface-pytorch-trcomp-training-2022-05-19-09-43-36-377/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{"sagemaker_training_compiler_debug_mode":false,"sagemaker_training_compiler_enabled":true},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":1,"learning_rate":1.8750000000000002e-05,"model_name":"bert-base-uncased","n_gpus":1,"train_batch_size":12},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-trcomp-training-2022-05-19-09-43-36-377","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-central-1-886370516942/huggingface-pytorch-trcomp-training-2022-05-19-09-43-36-377/source/sourcedir.tar.gz","module_name":"simcse_unsupervised","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"simcse_unsupervised.py"}
SM_USER_ARGS=["--epochs","1","--learning_rate","1.8750000000000002e-05","--model_name","bert-base-uncased","--n_gpus","1","--train_batch_size","12"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_EPOCHS=1
SM_HP_LEARNING_RATE=1.8750000000000002e-05
SM_HP_MODEL_NAME=bert-base-uncased
SM_HP_N_GPUS=1
SM_HP_TRAIN_BATCH_SIZE=12
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220304-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/urllib3-1.26.8-py3.8.egg
Invoking script with the following command:
/opt/conda/bin/python3.8 simcse_unsupervised.py --epochs 1 --learning_rate 1.8750000000000002e-05 --model_name bert-base-uncased --n_gpus 1 --train_batch_size 12
[2022-05-19 09:51:43.502 torch.__training_compiler__.TrainingCompilerConfig INFO] Found configuration for Training Compiler. Compiler will be configured during import of torch_xla.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 491kB/s]
Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]
Downloading:   2%|▏         | 10.1M/420M [00:00<00:04, 106MB/s]
Downloading:   5%|▌         | 21.2M/420M [00:00<00:03, 112MB/s]
Downloading:   8%|▊         | 31.9M/420M [00:00<00:03, 108MB/s]
Downloading:  10%|█         | 42.7M/420M [00:00<00:03, 110MB/s]
Downloading:  13%|█▎        | 53.2M/420M [00:00<00:03, 110MB/s]
Downloading:  15%|█▌        | 63.7M/420M [00:00<00:03, 107MB/s]
Downloading:  18%|█▊        | 74.2M/420M [00:00<00:03, 108MB/s]
Downloading:  20%|██        | 84.6M/420M [00:00<00:03, 108MB/s]
Downloading:  23%|██▎       | 94.9M/420M [00:00<00:03, 105MB/s]
Downloading:  25%|██▌       | 106M/420M [00:01<00:03, 108MB/s]
Downloading:  28%|██▊       | 116M/420M [00:01<00:03, 99.9MB/s]
Downloading:  30%|██▉       | 126M/420M [00:01<00:03, 92.4MB/s]
Downloading:  32%|███▏      | 136M/420M [00:01<00:03, 94.7MB/s]
Downloading:  35%|███▍      | 145M/420M [00:01<00:02, 96.4MB/s]
Downloading:  37%|███▋      | 155M/420M [00:01<00:02, 97.6MB/s]
Downloading:  39%|███▉      | 164M/420M [00:01<00:02, 96.4MB/s]
Downloading:  41%|████▏     | 174M/420M [00:01<00:02, 97.6MB/s]
Downloading:  44%|████▎     | 183M/420M [00:01<00:02, 98.7MB/s]
Downloading:  46%|████▌     | 193M/420M [00:01<00:02, 97.4MB/s]
Downloading:  48%|████▊     | 203M/420M [00:02<00:02, 98.9MB/s]
Downloading:  51%|█████     | 213M/420M [00:02<00:02, 103MB/s]
Downloading:  53%|█████▎    | 224M/420M [00:02<00:01, 107MB/s]
Downloading:  56%|█████▌    | 235M/420M [00:02<00:01, 108MB/s]
Downloading:  59%|█████▊    | 246M/420M [00:02<00:01, 110MB/s]
Downloading:  61%|██████    | 257M/420M [00:02<00:01, 111MB/s]
Downloading:  64%|██████▎   | 268M/420M [00:02<00:01, 111MB/s]
Downloading:  66%|██████▋   | 278M/420M [00:02<00:01, 112MB/s]
Downloading:  69%|██████▉   | 289M/420M [00:02<00:01, 113MB/s]
Downloading:  71%|███████▏  | 300M/420M [00:03<00:01, 107MB/s]
Downloading:  74%|███████▍  | 311M/420M [00:03<00:01, 106MB/s]
Downloading:  77%|███████▋  | 322M/420M [00:03<00:00, 109MB/s]
Downloading:  79%|███████▉  | 332M/420M [00:03<00:00, 102MB/s]
Downloading:  82%|████████▏ | 342M/420M [00:03<00:00, 104MB/s]
Downloading:  84%|████████▍ | 353M/420M [00:03<00:00, 106MB/s]
Downloading:  86%|████████▋ | 363M/420M [00:03<00:00, 105MB/s]
Downloading:  89%|████████▉ | 373M/420M [00:03<00:00, 105MB/s]
Downloading:  91%|█████████▏| 384M/420M [00:03<00:00, 108MB/s]
Downloading:  94%|█████████▍| 395M/420M [00:03<00:00, 104MB/s]
Downloading:  97%|█████████▋| 405M/420M [00:04<00:00, 107MB/s]
Downloading:  99%|█████████▉| 416M/420M [00:04<00:00, 108MB/s]
Downloading: 100%|██████████| 420M/420M [00:04<00:00, 105MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]
Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 20.1kB/s]
Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]
Downloading:  12%|█▏        | 28.0k/226k [00:00<00:01, 161kB/s]
Downloading:  87%|████████▋ | 197k/226k [00:00<00:00, 638kB/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 648kB/s]
Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]
Downloading:   6%|▌         | 28.0k/455k [00:00<00:02, 161kB/s]
Downloading:  46%|████▌     | 208k/455k [00:00<00:00, 674kB/s]
Downloading: 100%|██████████| 455k/455k [00:00<00:00, 1.04MB/s]
2022-05-19 09:52:04,459 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device: cuda
******* list files *********:  ['mongo_training_events.parquet']
********************** Reading Data *************************
********************** Reading Processed(Cleaned) Data *************************
Sample data:
                                            MSG_CLEAN
0  newmsg alerts triggered on oracle_availability...
Training batch size: 12
12
using torch.nograd
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[2022-05-19 09:52:10.752: W smdistributed/modelparallel/torch/nn/predefined_hooks.py:47] Found unsupported HuggingFace version 4.17.0 for automated tensor parallelism. HuggingFace modules will not be automatically distributed. You can use smp.tp_register_with_module API to register desired modules for tensor parallelism, or directly instantiate an smp.nn.DistributedModule. Supported HuggingFace transformers versions for automated tensor parallelism: ['4.16.2']
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/14926 [00:00<?, ?it/s]#033[A
Iteration:   0%|          | 0/14926 [00:03<?, ?it/s]
#015Epoch:   0%|          | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "simcse_unsupervised.py", line 183, in <module>
model.fit(
  File "/opt/conda/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 712, in fit
loss_value = loss_model(features, labels)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sentence_transformers/losses/MultipleNegativesRankingLoss.py", line 53, in forward
reps = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]
  File "/opt/conda/lib/python3.8/site-packages/sentence_transformers/losses/MultipleNegativesRankingLoss.py", line 53, in <listcomp>
reps = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 66, in forward
output_states = self.auto_model(**trans_features, return_dict=False)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 996, in forward
encoder_outputs = self.encoder(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 585, in forward
layer_outputs = layer_module(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 513, in forward
layer_output = apply_chunking_to_forward(
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2472, in apply_chunking_to_forward
return forward_fn(*input_tensors)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 525, in feed_forward_chunk
intermediate_output = self.intermediate(attention_output)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 427, in forward
hidden_states = self.intermediate_act_fn(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/activations.py", line 56, in forward
return self.act(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1556, in gelu
return torch._C._nn.gelu(input)
RuntimeError:
CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 14.76 GiB total capacity; 13.83 GiB already allocated; 19.75 MiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2022-05-19 09:52:15,271 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2022-05-19 09:52:15,272 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError:
 CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 14.76 GiB total capacity; 13.83 GiB already allocated; 19.75 MiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
Command "/opt/conda/bin/python3.8 simcse_unsupervised.py --epochs 1 --learning_rate 1.8750000000000002e-05 --model_name bert-base-uncased --n_gpus 1 --train_batch_size 12"
2022-05-19 09:52:15,272 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2022-05-19 09:52:29 Uploading - Uploading generated training model
2022-05-19 09:52:29 Failed - Training job failed

May 19 '22 10:05 vldbnc

Hi, Thanks for bringing this to our attention.

Can you provide more details:

What sequence length are you using ?
Are you using a public dataset ? If yes, what is it ?
Are you using AMP (Automatic Mixed Precision) ?

Aug 11 '22 01:08 Lokiiiiii

Closing this issue since there has been no activity.

Sep 19 '22 19:09 codeislife99

[bug] OOM for GPU while using recommended batch_size