What does this PR do ?

Speeds up RNN-T greedy decoding greatly by eliminating the 90% of the time that the GPU is idle, waiting on the CPU, via cuda graphs with conditional nodes.

This is a squashing of #7976 . I wanted to squash that, but unfortunately I reference a few commits in various bugs I filed, and didn't want the links to break.

Collection: ASR. Adds some utils for cuda-python to common.

Changelog

Adds a new class RNNTGreedyDecodeFast, which uses cuda graphs with conditional nodes to remove the CPU overhead.

Usage

You can potentially add a usage example below

python examples/asr/speech_to_text_eval.py  pretrained_name=stt_en_fastconformer_transducer_xlarge  dataset_manifest=/home/dgalvez/scratch/data/test_other.json  batch_size=16  output_filename=test_other_decoded.jsonl  amp=true  amp_dtype=bfloat16 rnnt_decoding.greedy.loop_labels=false ++rnnt_decoding.greedy.go_very_fast=true  use_cer=false num_workers=1

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

[ x ] Make sure you read and followed Contributor guidelines
[ x ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ x ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ x ] New Feature
[ ] Bugfix
[ ] Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to #7976

Jan 17 '24 19:01 galv

jenkins

Jan 18 '24 14:01 nithinraok

@titu1994 for RNNT review. @galv instead of go_very_fast can you change to enable_cuda_graphs=true something like that?

Jan 18 '24 14:01 nithinraok

This PR is ready for review. @artbataev @titu1994 would you be willing?

Jan 30 '24 00:01 galv

I can review this tomorrow @artbataev could you review it too ?

Jan 30 '24 00:01 titu1994

Cool, I will review the PR today or tomorrow.

@galv Please fix DCO (anyway, you will need to fix it for merging).

Jan 30 '24 08:01 artbataev

Jenkins

Jan 30 '24 09:01 titu1994

jenkins

Feb 15 '24 15:02 artbataev

jenkins

Feb 16 '24 10:02 artbataev

jenkins

Feb 16 '24 19:02 tbartley94

@artbataev Let me know if you think this is good to go.

Feb 21 '24 00:02 galv

@galv as I see, the issue can be fixed when passing appropriate device to cuda streams initializers and getters:

def with_conditional_node(while_loop_kernel, while_loop_args, while_loop_conditional_handle, device):
    ...
    body_stream = torch.cuda.Stream(device=device)
    previous_stream = torch.cuda.current_stream(device=device)

def _reinitialize(...):
    ...
    with torch.cuda.stream(torch.cuda.Stream(device=device)), torch.inference_mode(), torch.cuda.graph(self.graph):
    ... 
    capture_status, _, graph, _, _ = cu_call(
                cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=device).cuda_stream)
    )

Feb 21 '24 13:02 artbataev

@artbataev thank you for the initial suggestion. It works when the decoder has not been run yet. However, it doesn't work if the decoder has already been run. You can see my failing test here: https://github.com/NVIDIA/NeMo/commit/36b32738edccfda31b480f96e5b86ce28a6850d8

Clearly something obscure is happening here. The commit message provides more details. I've spent a few hours trying to debug this so I need to stop for the day.

Feb 22 '24 05:02 galv

@galvI tried some changes, and it seems I can get it to work. But I'm wondering why these changes are required and why everything works when creating a graph for the first time on any device. There may be some bugs in PyTorch.

Pass stream explicitly to torch.cuda.graph. After this change, I'm able to run the test, but the final comparison of the results fails (the results seem to be incorrect for the second run)
Pass the device explicitly to all calls torch.cuda.current_stream. After this change, the test is passed.

# Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
stream_for_graph = torch.cuda.Stream(self.device)
with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(self.graph, stream=stream_for_graph):
    ... # capture graph
   # pass device explicitly
   capture_status, _, graph, _, _ = cu_call(
       cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=self.device).cuda_stream)
   )
   ...

@contextlib.contextmanager
def with_conditional_node(while_loop_kernel, while_loop_args, while_loop_conditional_handle, device):
    ...
    # pass device explicitly here and in other calls
    capture_status, _, graph, _, _ = cu_call(cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=device).cuda_stream))
    ...

You can see the full commit here: https://github.com/artbataev/NeMo/commit/77fc36e199d5c804b447cef0c817340ef4e0334c

Feb 22 '24 12:02 artbataev

jenkins

Feb 22 '24 19:02 galv

jenkins

Feb 22 '24 21:02 galv

jenkins

Previous failure seems to be a spurious failure caused by git clone failing.

@artbataev I incorporated your change after verifying it on a multi-GPU machine. Thank you again. I made one more commit https://github.com/NVIDIA/NeMo/pull/8191/commits/fb2bd7aeb6a99540b7ca1d2334e1502ad989b8a7 as well which makes this work with cuda-python version 12.4.0 and greater. It turns out that the bug fix in that version makes the phGraph_out variable not writable. So I must not use my workaround when cuda-python > 12.3.0. Things are well tested at this point.

Feb 23 '24 06:02 galv

@galv I manually restarted Jenkins, but it is still waiting for an executor

Feb 23 '24 12:02 artbataev

@galv please fix the test failing on Jenkins (the guard is needed)

FAILED tests/collections/asr/decoding/test_cuda_graph_rnnt_greedy_decoding.py::test_change_devices - ImportError: Found cuda-python 12.3.0rc4+8.gcb4e395, but at least version 12.3.0 is needed.

Feb 23 '24 13:02 artbataev

jenkins

Sorry for missing the guard in that test. Hopefully things go through now.

Feb 23 '24 19:02 galv

jenkins

Feb 24 '24 14:02 artbataev

Dgalvez/cuda graphs greedy rnnt inference squash

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information