Add support for NeMo Run to ASR
What does this PR do ?
Adds NeMo run support to ASR and common utilities for Run to common collections
Collection: [ASR, Common]
Changelog
- Add specific line by line info of high level changes in this PR.
Usage
Local Execution
conf/run_local.yaml
# The script to be run.
script: ???
script_config: ???
exp_name: null # populated by exp_manager.name if not provided
results_dir: ??? # Where to store the results of the run
num_runs: 1
num_tasks_per_node: 1
########################################################################################################################
executor: local
containers:
asr: nvcr.io/nvidia/nemo:24.07 # or nvcr.io/nvidia/nemo:dev
mounts:
- "~/.cache/torch/NeMo:/cache/torch/NeMo" # To mount your nemo cache dir (if needed for pretrained models)
Call run_helper.py
python run_helper.py --config-path "conf" --config-name "run_local.yaml" \
script=asr_ctc/speech_to_text_ctc_bpe.py \
script_config=conf/conformer/conformer_ctc_bpe.yaml \
results_dir=$PWD/results \
++model.train_ds.manifest_filepath=/manifests/train_clean_5.json \
++model.validation_ds.manifest_filepath="/manifests/dev_clean_2.json" \
++model.tokenizer.dir=/manifests/librispeech_tokenizer_spe_unigram_v1024 \
++mount_1="<Path to Manifests>/librispeech/manifests:/manifests" \
++mount_2="<Data Path>:/data"
Cluster Execution
conf/run_slurm.yaml
# The script to be run.
script: ???
script_config: ???
exp_name: null # populated by exp_manager.name if not provided
results_dir: ??? # Where to store the results of the run
# Optional arguments
num_runs: 1
num_tasks_per_node: 8
max_runtime: "00:03:45:00"
########################################################################################################################
executor: slurm
ssh_tunnel:
host: <CLUSTER HOST>
# ------------------------------- Fill this up! -------------------------------
user: "${USER}" # your username; or resolved from ${USER} environment variable
job_dir: <DIRECTORY TO STORE NEMO RUN JOB INFO>
identity: "${CLUSTER_SSH_IDENTITY}"
# -----------------------------------------------------------------------------
account: <SLURM ACCOUNT>
partition: <SLURM PARTITIONS>
job_name_prefix: <JOB PREFIX NAMES>
containers:
asr: <CONTAINER NAME>
# These env vars are propagated to slurm runtime
env_vars:
- 'TOKENIZERS_PARALLELISM=false'
- 'LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE=0.3'
- 'TORCH_CUDNN_V8_API_ENABLED=1'
- 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True'
- 'HYDRA_FULL_ERROR=1'
# These env vars are propagated to slurm runtime
required_env_vars:
- 'HF_TOKEN'
mounts:
# Replace with your own paths in your cluster config
- <DATA PATH>:/data
- <CHECKPOINTS PATH>:/asr_checkpoints
timeouts:
interactive: 04:00:00
########################################################################################################################
IMPORTANT NOTE
NOTE: Be very careful with using ${} syntax inside of your hydra overrides - it will try to resolve using your env variables if you use double quotes ("). If you want to provide "hydra placeholders" - use SINGLE QUOTES (') as shown below for ++name and ++results_dir
Call run_helper.py
python run_helper.py --config-path conf/ --config-name \
run_slurm script=speech_multitask/speech_to_text_aed.py \
script_config=conf/aed_config.yaml \
exp_name=<JOB NAME> \
results_dir='/results/${exp_name}' \
num_runs=2 \
++trainer.num_nodes=2 \
++name='${exp_name}' \
++exp_manager.wandb_logger_kwargs.project="nemo_asr" \
++USER=$USER \
++CLUSTER_SSH_IDENTITY=$CLUSTER_SSH_IDENTITY
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [x] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [x] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.