Low GPU utilization for multi-GPU training
I have trained a Conformer model using my own custom dataset in Thai. However, GPU Utilization seems to be pretty low as the training speed is pretty slow (~2 s/batch). The GPU was utilized by around 5-10%. Are there anyways to debug this problem?
For training, I simply edited config.yaml in examples/conformer/config.yaml and run
$ python examples/conformer/train_conformer.py --device 0 1 2 3
Software Specification: OS: Debian GNU/Linux 10 (buster) (GNU/Linux 4.19.0-9-cloud-amd64 x86_64\n) GPUs: Nvidia Tesla V100 16Gb RAM Installed by building from source
config.yaml
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
vocabulary: vocabularies/thai.characters
target_vocab_size: 1024
max_subword_length: 4
blank_at_zero: True
beam_width: 5
norm_score: True
model_config:
name: conformer
subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
positional_encoding: sinusoid_concat
dmodel: 144
num_blocks: 16
head_size: 36
num_heads: 4
mha_type: relmha
kernel_size: 32
fc_factor: 0.5
dropout: 0.1
embed_dim: 320
embed_dropout: 0.1
num_rnns: 1
rnn_units: 320
rnn_type: lstm
layer_norm: True
joint_dim: 320
learning_config:
augmentations:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
dataset_config:
train_paths:
- /home/chompk/trainv1_trainscript.tsv
eval_paths:
- /home/chompk/valv1_trainscript.tsv
test_paths:
- /mnt/d/SpeechProcessing/Datasets/LibriSpeech/test-clean/transcripts.tsv
tfrecords_dir: null
optimizer_config:
warmup_steps: 40000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 4
accumulation_steps: 4
num_epochs: 20
outdir: /mnt/d/SpeechProcessing/Trained/local/conformer
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
GPU Utilization
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 58W / 300W | 15752MiB / 16130MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 38C P0 66W / 300W | 15704MiB / 16130MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 40C P0 66W / 300W | 15752MiB / 16130MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 39C P0 58W / 300W | 15704MiB / 16130MiB | 7% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9598 C python 15741MiB |
| 1 9598 C python 15693MiB |
| 2 9598 C python 15741MiB |
| 3 9598 C python 15693MiB |
+-----------------------------------------------------------------------------+
Training Steps Example
> Start evaluation ...
[Eval] [Step 1000] |████████████████████| 4423/4423 [22:08<00:00, 3.33batch/s, transducer_loss=171.865
> End evaluation ...
[Train] [Epoch 1/20] | | 1500/796100 [1:38:28<421:04:34, 1.91s/batch, transducer_loss=159.42458]
> Start evaluation ...
[Eval] [Step 1500] |████████████████████| 4423/4423 [23:15<00:00, 3.17batch/s, transducer_loss=153.2395]
> End evaluation ...
[Train] [Epoch 1/20] | | 2000/796100 [2:18:06<456:58:56, 2.07s/batch, transducer_loss=140.7582]
> Start evaluation ...
[Eval] [Step 2000] |████████████████████| 4423/4423 [22:36<00:00, 3.26batch/s, transducer_loss=137.00543]
> End evaluation ...
[Train] [Epoch 1/20] | | 2500/796100 [2:57:05<409:56:45, 1.86s/batch, transducer_loss=126.64603]
> Start evaluation ...
[Eval] [Step 2500] |████████████████████| 4423/4423 [22:52<00:00, 3.22batch/s, transducer_loss=126.15583]
> End evaluation ...
[Train] [Epoch 1/20] | | 2648/796100 [3:23:48<506:25:46, 2.30s/batch, transducer_loss=125.96002
This is weird, try using --mxp, mixed precision is faster. Anyway, I've tested on rtx 2080ti and gpu usage is around 30-70%.
you could also try caching the dataset using the cache flag if you have enough ram. After the first epoch I had at least 2 batch/s if I remember correctly using T4 with high gpu-util
you could also try caching the dataset using the cache flag if you have enough ram. After the first epoch I had at least 2 batch/s if I remember correctly using T4 with high gpu-util
Sorry for the stupid question, how do I use cache flag?
@tann9949 sure just pass --cache in your training call you can use tfrecords too --tfrecords and specify a directory for them to be stored in the tfrecords_dir: inside your config.yml the records will be created for you if they don't exist yet the first time you run the training script
@bill-kalog Thanks for the reply!
@tann9949 sure just pass
--cachein your training call you can use tfrecords too--tfrecordsand specify a directory for them to be stored in thetfrecords_dir:inside yourconfig.ymlthe records will be created for you if they don't exist yet the first time you run the training script
I've tried this method and this speeds up by ~1.2 s/batch. Still, GPU utilization is still merely 5%
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 58W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 39C P0 58W / 300W | 15704MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 40C P0 57W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 39C P0 57W / 300W | 15704MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18286 C python 15741MiB |
| 1 18286 C python 15693MiB |
| 2 18286 C python 15741MiB |
| 3 18286 C python 15693MiB |
+-----------------------------------------------------------------------------+
I'm not sure if it's about my sequence length of the audio file, my maximum sequence length is around 30 seconds. I have no idea how to debug this problem.
@tann9949 I noticed from step 2500, the speed reduces from 3.1batch/s to 2s/batch. Is the usage still around 5% within that first 2500 steps?
The 2s/batch was during training, 3.1batch/s was during eval. The GPU usage is around 0-5% during training while 20-40% during eval

@tann9949 my bad, so the problem might lie in optimizer.apply_gradients for tape.gradients. Can you try running with train_ga_conformer.py? It's gradient accumulation training
For some reason, this gets even worse

@tann9949 Yeah, as expected, because it runs with larger batch size. Can you train on librispeech? So that I can see whether it is because of GPU or the code or the data.
@usimarit I've run model training on librispeech with this configuration
config.yaml
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
vocabulary: null
target_vocab_size: 1024
max_subword_length: 4
blank_at_zero: True
beam_width: 5
norm_score: True
model_config:
name: conformer
subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
positional_encoding: sinusoid_concat
dmodel: 144
num_blocks: 16
head_size: 36
num_heads: 4
mha_type: relmha
kernel_size: 32
fc_factor: 0.5
dropout: 0.1
embed_dim: 320
embed_dropout: 0.1
num_rnns: 1
rnn_units: 320
rnn_type: lstm
layer_norm: True
joint_dim: 320
learning_config:
augmentations:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
dataset_config:
train_paths:
- /home/chompk/librispeech/LibriSpeech/train-clean-100/transcript.tsv
eval_paths:
- /home/chompk/librispeech/LibriSpeech/dev-clean/transcript.tsv
- /home/chompk/librispeech/LibriSpeech/dev-other/transcript.tsv
test_paths:
- /home/chompk/librispeech/LibriSpeech/test-clean/transcript.tsv
tfrecords_dir: /home/chompk/tfrecords_data
optimizer_config:
warmup_steps: 40000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 4
accumulation_steps: 4
num_epochs: 20
outdir: /home/chompk/conformer_libri
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
Still, GPU utilization is around 0%
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 39C P0 63W / 300W | 15704MiB / 16130MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 40C P0 69W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 39C P0 65W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 60W / 300W | 15704MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9007 C python 15693MiB |
| 1 9007 C python 15741MiB |
| 2 9007 C python 15741MiB |
| 3 9007 C python 15693MiB |
+-----------------------------------------------------------------------------+
The training was around 2.06s/batch (without gradient accumulation)
[Train] [Epoch 1/20] |▏ | 258/35660 [12:16<22:02:21, 2.24s/batch, transducer_loss=1021.1291]
I've also tried Librispeech training with train_ga_conformer.py. Still have a worse performance but better GPU utilization
[Train] [Epoch 1/20] | | 8/8900 [10:58<141:30:15, 57.29s/batch, transducer_loss=1505.1484]
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 39C P0 94W / 300W | 15626MiB / 16130MiB | 15% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 41C P0 69W / 300W | 15626MiB / 16130MiB | 21% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 39C P0 66W / 300W | 15626MiB / 16130MiB | 32% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 60W / 300W | 15626MiB / 16130MiB | 25% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15032 C python 15615MiB |
| 1 15032 C python 15615MiB |
| 2 15032 C python 15615MiB |
| 3 15032 C python 15615MiB |
+-----------------------------------------------------------------------------+
@tann9949 So the problem is not from dataset, so it might be tensorflow and this GPU V100
@tann9949 how is the CPU utilization? is it 100%?
I've used 8 vCPUs, 30 GB memory. Each CPU core usage was around 40-50%. I'm not sure whether it's a bottleneck on feature extraction or not but from what I've tried, Using TFRecords tends to speed up the most (from ~2.5s/batch -> ~1.4s/batch). Using --mxp and --cache doesn't help that much.
I think you can try to reproduce my error by training Librispeech on google's VM using image pytorch-1-4-cu101.
@tann9949 please use https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras to trace what is the reason :)))
I'm also suffering from log GPU utilization, but even with a single GPU. See graph below. Detals:
- TensorFlowASR v0.7.1
-
train_ga_conformer - config - see below.
If I try to increase batch size - it fails with OOM instantly so it's the best I could get.
Config:
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
vocabulary: vocabularies/lithuanian.characters
target_vocab_size: 4096
max_subword_length: 4
blank_at_zero: True
beam_width: 5
norm_score: True
model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid_concat
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: False
prediction_projection_units: 0
joint_dim: 640
joint_activation: tanh
learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /tf_asr/manifests/cc_manifest_train.tsv
tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-train
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
eval_dataset_config:
use_tf: True
data_paths:
- /tf_asr/manifests/cc_manifest_eval.tsv
tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-eval
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
test_dataset_config:
use_tf: True
data_paths:
- /tf_asr/manifests/cc_manifest_test.tsv
tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-test
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
optimizer_config:
warmup_steps: 40000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 8
accumulation_steps: 16
num_epochs: 20
outdir: /tf_asr/models
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
checkpoint:
filepath: /tf_asr/models/checkpoints/{epoch:02d}.h5
save_best_only: True
save_weights_only: False
save_freq: epoch
states_dir: /tf_asr/models/states
tensorboard:
log_dir: /tf_asr/models/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: 'epoch'
profile_batch: 2

I've been testing some models and I see that models which does not have RNN will take like 90-100% GPU utilization and models have at least 1 RNN will take like from 25-70% GPU utilization. Do you guys have any idea to improve GPU utilization for RNN?
I also suffered by GPU utilized only low and occasionaly. Switching to Keras helped.
Do you guys still have this issue?
I solved the problem by installing the cuda and cudnn drives through sudo from linux, for some reason, when installing through anaconda this problem happens