Add support for NeMo Run to ASR

Open titu1994 opened this issue 1 year ago • 0 comments

What does this PR do ?

Adds NeMo run support to ASR and common utilities for Run to common collections

Collection: [ASR, Common]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

Local Execution

conf/run_local.yaml

# The script to be run.
script: ???
script_config: ???

exp_name: null  # populated by exp_manager.name if not provided
results_dir: ???  # Where to store the results of the run

num_runs: 1
num_tasks_per_node: 1

########################################################################################################################

executor: local

containers:
  asr: nvcr.io/nvidia/nemo:24.07  # or nvcr.io/nvidia/nemo:dev

mounts:
  - "~/.cache/torch/NeMo:/cache/torch/NeMo"  # To mount your nemo cache dir (if needed for pretrained models)

Call run_helper.py

python run_helper.py --config-path "conf" --config-name "run_local.yaml" \
  script=asr_ctc/speech_to_text_ctc_bpe.py \
  script_config=conf/conformer/conformer_ctc_bpe.yaml \
  results_dir=$PWD/results \
  ++model.train_ds.manifest_filepath=/manifests/train_clean_5.json \ 
  ++model.validation_ds.manifest_filepath="/manifests/dev_clean_2.json" \ 
  ++model.tokenizer.dir=/manifests/librispeech_tokenizer_spe_unigram_v1024 \ 
  ++mount_1="<Path to Manifests>/librispeech/manifests:/manifests" \ 
  ++mount_2="<Data Path>:/data"

Cluster Execution

conf/run_slurm.yaml

# The script to be run.
script: ???
script_config: ???

exp_name: null  # populated by exp_manager.name if not provided
results_dir: ???  # Where to store the results of the run

# Optional arguments
num_runs: 1
num_tasks_per_node: 8
max_runtime: "00:03:45:00"

########################################################################################################################

executor: slurm

ssh_tunnel:
  host: <CLUSTER HOST>
  # ------------------------------- Fill this up! -------------------------------
  user: "${USER}"  # your username; or resolved from ${USER} environment variable 
  job_dir: <DIRECTORY TO STORE NEMO RUN JOB INFO>
  identity: "${CLUSTER_SSH_IDENTITY}"
  # -----------------------------------------------------------------------------

account: <SLURM ACCOUNT>
partition: <SLURM PARTITIONS>
job_name_prefix: <JOB PREFIX NAMES>

containers:
  asr:  <CONTAINER NAME>

# These env vars are propagated to slurm runtime
env_vars:
  - 'TOKENIZERS_PARALLELISM=false'
  - 'LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE=0.3'
  - 'TORCH_CUDNN_V8_API_ENABLED=1'
  - 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True'
  - 'HYDRA_FULL_ERROR=1'

# These env vars are propagated to slurm runtime
required_env_vars:
  - 'HF_TOKEN'

mounts:
  # Replace with your own paths in your cluster config
  - <DATA PATH>:/data
  - <CHECKPOINTS PATH>:/asr_checkpoints

timeouts:
  interactive: 04:00:00

########################################################################################################################

IMPORTANT NOTE

NOTE: Be very careful with using ${} syntax inside of your hydra overrides - it will try to resolve using your env variables if you use double quotes ("). If you want to provide "hydra placeholders" - use SINGLE QUOTES (') as shown below for ++name and ++results_dir

Call run_helper.py

python run_helper.py --config-path conf/ --config-name \ 
  run_slurm script=speech_multitask/speech_to_text_aed.py \
  script_config=conf/aed_config.yaml \
  exp_name=<JOB NAME> \
  results_dir='/results/${exp_name}' \
  num_runs=2 \
  ++trainer.num_nodes=2 \
  ++name='${exp_name}' \
  ++exp_manager.wandb_logger_kwargs.project="nemo_asr" \
  ++USER=$USER \
  ++CLUSTER_SSH_IDENTITY=$CLUSTER_SSH_IDENTITY

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[x] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[x] New Feature
[ ] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Oct 17 '24 20:10 titu1994