training Add MLCube support for RNN speech recognition

Used PR #465 as reference.

Current implementation

We'll be updating this section as we merge MLCube PRs and make new MLCube releases.

Project setup

# Create Python environment and install MLCube Docker runner 
virtualenv -p python3 ./env && source ./env/bin/activate && pip install mlcube-docker

# Fetch the RNN speech recognition workload
git clone https://github.com/mlcommons/training && cd ./training
git fetch origin pull/491/head:feature/rnnt_mlcube && git checkout feature/rnnt_mlcube
cd ./rnn_speech_recognition/mlcube

Dataset

The Librispeech dataset will be downloaded, extracted, and processed. Sizes of the dataset in each step:

Dataset Step	MLCube Task	Format	Size
Download (Compressed dataset)	download_data	Tar files	~62 GB
Extract (Uncompressed dataset)	download_data	Flac files	~64 GB
Preprocess (Processed dataset)	preprocess_data	Wav files	~114 GB
Total	(After all tasks)	All	~240 GB

Tasks execution

# Download Librispeech dataset. Default path = /workspace/data
# To override it, use data_dir=DATA_DIR
mlcube run --task download_data

# Preprocess Librispeech dataset, this will convert .flac audios to .wav format
# It will use the DATA_DIR path defined in the previous step
mlcube run --task preprocess_data

# Run benchmark. Default paths = ./workspace/data
# Parameters to override: data_dir=DATA_DIR, output_dir=OUTPUT_DIR, parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train

We are targeting pull-type installation, so MLCube images should be available on docker hub. If not, try this:

mlcube run ... -Pdocker.build_strategy=always

Jun 24 '21 22:06 davidjurado

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Jun 24 '21 22:06 github-actions[bot]

Hello @davidjurado! I tried to follow the task execution steps, but the last step failed with the following error:

$ mlcube run --task train
Usage: mlcube.py train [OPTIONS]
Try 'mlcube.py train --help' for help.

Error: Missing option '--output_dir'.
2023-05-19 09:35:17 [...]

Your description sais:

# Run benchmark. Default paths = ./workspace/data
# Parameters to override: data_dir=DATA_DIR, output_dir=OUTPUT_DIR, parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train

How to override the output_dir?

May 19 '23 07:05 mwawrzos

@davidjurado can you answer @mwawrzos 's question. We can merge this accordingly.

Mar 08 '24 03:03 nv-rborkar