Dgalvez/cuda graphs greedy rnnt inference squash
What does this PR do ?
Speeds up RNN-T greedy decoding greatly by eliminating the 90% of the time that the GPU is idle, waiting on the CPU, via cuda graphs with conditional nodes.
This is a squashing of #7976 . I wanted to squash that, but unfortunately I reference a few commits in various bugs I filed, and didn't want the links to break.
Collection: ASR. Adds some utils for cuda-python to common.
Changelog
- Adds a new class RNNTGreedyDecodeFast, which uses cuda graphs with conditional nodes to remove the CPU overhead.
Usage
- You can potentially add a usage example below
python examples/asr/speech_to_text_eval.py pretrained_name=stt_en_fastconformer_transducer_xlarge dataset_manifest=/home/dgalvez/scratch/data/test_other.json batch_size=16 output_filename=test_other_decoded.jsonl amp=true amp_dtype=bfloat16 rnnt_decoding.greedy.loop_labels=false ++rnnt_decoding.greedy.go_very_fast=true use_cer=false num_workers=1
Jenkins CI
To run Jenkins, a NeMo User with write access must comment jenkins on the PR.
Before your PR is "Ready for review"
Pre checks:
- [ x ] Make sure you read and followed Contributor guidelines
- [ x ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ x ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [ x ] New Feature
- [ ] Bugfix
- [ ] Documentation
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to #7976
jenkins
@titu1994 for RNNT review.
@galv instead of go_very_fast can you change to enable_cuda_graphs=true something like that?
This PR is ready for review. @artbataev @titu1994 would you be willing?
I can review this tomorrow @artbataev could you review it too ?
Cool, I will review the PR today or tomorrow.
@galv Please fix DCO (anyway, you will need to fix it for merging).
Jenkins
jenkins
jenkins
jenkins
@artbataev Let me know if you think this is good to go.
@galv as I see, the issue can be fixed when passing appropriate device to cuda streams initializers and getters:
def with_conditional_node(while_loop_kernel, while_loop_args, while_loop_conditional_handle, device):
...
body_stream = torch.cuda.Stream(device=device)
previous_stream = torch.cuda.current_stream(device=device)
def _reinitialize(...):
...
with torch.cuda.stream(torch.cuda.Stream(device=device)), torch.inference_mode(), torch.cuda.graph(self.graph):
...
capture_status, _, graph, _, _ = cu_call(
cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=device).cuda_stream)
)
@artbataev thank you for the initial suggestion. It works when the decoder has not been run yet. However, it doesn't work if the decoder has already been run. You can see my failing test here: https://github.com/NVIDIA/NeMo/commit/36b32738edccfda31b480f96e5b86ce28a6850d8
Clearly something obscure is happening here. The commit message provides more details. I've spent a few hours trying to debug this so I need to stop for the day.
@galvI tried some changes, and it seems I can get it to work. But I'm wondering why these changes are required and why everything works when creating a graph for the first time on any device. There may be some bugs in PyTorch.
- Pass stream explicitly to
torch.cuda.graph. After this change, I'm able to run the test, but the final comparison of the results fails (the results seem to be incorrect for the second run) - Pass the device explicitly to all calls
torch.cuda.current_stream. After this change, the test is passed.
# Always create a new stream, because the per-thread default stream disallows stream capture to a graph.
stream_for_graph = torch.cuda.Stream(self.device)
with torch.cuda.stream(stream_for_graph), torch.inference_mode(), torch.cuda.graph(self.graph, stream=stream_for_graph):
... # capture graph
# pass device explicitly
capture_status, _, graph, _, _ = cu_call(
cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=self.device).cuda_stream)
)
...
@contextlib.contextmanager
def with_conditional_node(while_loop_kernel, while_loop_args, while_loop_conditional_handle, device):
...
# pass device explicitly here and in other calls
capture_status, _, graph, _, _ = cu_call(cudart.cudaStreamGetCaptureInfo(torch.cuda.current_stream(device=device).cuda_stream))
...
You can see the full commit here: https://github.com/artbataev/NeMo/commit/77fc36e199d5c804b447cef0c817340ef4e0334c
jenkins
jenkins
jenkins
Previous failure seems to be a spurious failure caused by git clone failing.
@artbataev I incorporated your change after verifying it on a multi-GPU machine. Thank you again. I made one more commit https://github.com/NVIDIA/NeMo/pull/8191/commits/fb2bd7aeb6a99540b7ca1d2334e1502ad989b8a7 as well which makes this work with cuda-python version 12.4.0 and greater. It turns out that the bug fix in that version makes the phGraph_out variable not writable. So I must not use my workaround when cuda-python > 12.3.0. Things are well tested at this point.
@galv I manually restarted Jenkins, but it is still waiting for an executor
@galv please fix the test failing on Jenkins (the guard is needed)
FAILED tests/collections/asr/decoding/test_cuda_graph_rnnt_greedy_decoding.py::test_change_devices - ImportError: Found cuda-python 12.3.0rc4+8.gcb4e395, but at least version 12.3.0 is needed.
jenkins
Sorry for missing the guard in that test. Hopefully things go through now.
jenkins