NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Dgalvez/cuda graphs greedy rnnt inference squash

Open galv opened this issue 2 years ago • 9 comments

What does this PR do ?

Speeds up RNN-T greedy decoding greatly by eliminating the 90% of the time that the GPU is idle, waiting on the CPU, via cuda graphs with conditional nodes.

This is a squashing of #7976 . I wanted to squash that, but unfortunately I reference a few commits in various bugs I filed, and didn't want the links to break.

Collection: ASR. Adds some utils for cuda-python to common.

Changelog

  • Adds a new class RNNTGreedyDecodeFast, which uses cuda graphs with conditional nodes to remove the CPU overhead.

Usage

  • You can potentially add a usage example below
python examples/asr/speech_to_text_eval.py  pretrained_name=stt_en_fastconformer_transducer_xlarge  dataset_manifest=/home/dgalvez/scratch/data/test_other.json  batch_size=16  output_filename=test_other_decoded.jsonl  amp=true  amp_dtype=bfloat16 rnnt_decoding.greedy.loop_labels=false ++rnnt_decoding.greedy.go_very_fast=true  use_cer=false num_workers=1

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • [ x ] Make sure you read and followed Contributor guidelines
  • [ x ] Did you write any new necessary tests?
  • [ ] Did you add or update any necessary documentation?
  • [ x ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ x ] New Feature
  • [ ] Bugfix
  • [ ] Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to #7976

galv avatar Jan 17 '24 19:01 galv

jenkins

nithinraok avatar Jan 18 '24 14:01 nithinraok

@titu1994 for RNNT review. @galv instead of go_very_fast can you change to enable_cuda_graphs=true something like that?

nithinraok avatar Jan 18 '24 14:01 nithinraok

This PR is ready for review. @artbataev @titu1994 would you be willing?

galv avatar Jan 30 '24 00:01 galv

I can review this tomorrow @artbataev could you review it too ?

titu1994 avatar Jan 30 '24 00:01 titu1994

Cool, I will review the PR today or tomorrow.

@galv Please fix DCO (anyway, you will need to fix it for merging).

artbataev avatar Jan 30 '24 08:01 artbataev

Jenkins

titu1994 avatar Jan 30 '24 09:01 titu1994

jenkins

artbataev avatar Feb 15 '24 15:02 artbataev

jenkins

artbataev avatar Feb 16 '24 10:02 artbataev

jenkins

tbartley94 avatar Feb 16 '24 19:02 tbartley94

@artbataev Let me know if you think this is good to go.

galv avatar Feb 21 '24 00:02 galv

@galv as I see, the issue can be fixed when passing appropriate device to cuda streams initializers and getters:

def with_conditional_node(while_loop_kernel, while_loop_args, while_loop_conditional_handle, device):
    ...
    body_stream = torch.cuda.Stream(device=device)
    previous_stream = torch.cuda.current_stream(device=device)

def _reinitialize(...):
    ...
    with torch.cuda.stream(torch.cuda.Stream(device=device)), torch.inference_mode(), torch.cuda.graph(self.graph):
    ... 
    capture_status, _, graph, _, _ = cu_call(
                cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=device).cuda_stream)
    )

artbataev avatar Feb 21 '24 13:02 artbataev

@artbataev thank you for the initial suggestion. It works when the decoder has not been run yet. However, it doesn't work if the decoder has already been run. You can see my failing test here: https://github.com/NVIDIA/NeMo/commit/36b32738edccfda31b480f96e5b86ce28a6850d8

Clearly something obscure is happening here. The commit message provides more details. I've spent a few hours trying to debug this so I need to stop for the day.

galv avatar Feb 22 '24 05:02 galv

@galvI tried some changes, and it seems I can get it to work. But I'm wondering why these changes are required and why everything works when creating a graph for the first time on any device. There may be some bugs in PyTorch.

  1. Pass stream explicitly to torch.cuda.graph. After this change, I'm able to run the test, but the final comparison of the results fails (the results seem to be incorrect for the second run)
  2. Pass the device explicitly to all calls torch.cuda.current_stream. After this change, the test is passed.
# Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
stream_for_graph = torch.cuda.Stream(self.device)
with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(self.graph, stream=stream_for_graph):
    ... # capture graph
   # pass device explicitly
   capture_status, _, graph, _, _ = cu_call(
       cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=self.device).cuda_stream)
   )
   ...

@contextlib.contextmanager
def with_conditional_node(while_loop_kernel, while_loop_args, while_loop_conditional_handle, device):
    ...
    # pass device explicitly here and in other calls
    capture_status, _, graph, _, _ = cu_call(cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=device).cuda_stream))
    ... 

You can see the full commit here: https://github.com/artbataev/NeMo/commit/77fc36e199d5c804b447cef0c817340ef4e0334c

artbataev avatar Feb 22 '24 12:02 artbataev

jenkins

galv avatar Feb 22 '24 19:02 galv

jenkins

galv avatar Feb 22 '24 21:02 galv

jenkins

Previous failure seems to be a spurious failure caused by git clone failing.

@artbataev I incorporated your change after verifying it on a multi-GPU machine. Thank you again. I made one more commit https://github.com/NVIDIA/NeMo/pull/8191/commits/fb2bd7aeb6a99540b7ca1d2334e1502ad989b8a7 as well which makes this work with cuda-python version 12.4.0 and greater. It turns out that the bug fix in that version makes the phGraph_out variable not writable. So I must not use my workaround when cuda-python > 12.3.0. Things are well tested at this point.

galv avatar Feb 23 '24 06:02 galv

@galv I manually restarted Jenkins, but it is still waiting for an executor

artbataev avatar Feb 23 '24 12:02 artbataev

@galv please fix the test failing on Jenkins (the guard is needed)

FAILED tests/collections/asr/decoding/test_cuda_graph_rnnt_greedy_decoding.py::test_change_devices - ImportError: Found cuda-python 12.3.0rc4+8.gcb4e395, but at least version 12.3.0 is needed.

artbataev avatar Feb 23 '24 13:02 artbataev

jenkins

Sorry for missing the guard in that test. Hopefully things go through now.

galv avatar Feb 23 '24 19:02 galv

jenkins

artbataev avatar Feb 24 '24 14:02 artbataev