CANN Backend support
CANN Backend support
Introduction
CANN (Compute Architecture of Neural Networks), developed by Huawei, is a heterogeneous computing architecture for AI scenarios.
It provides multi-layer programming interfaces to help users quickly build AI applications and services based on the Ascend platform.
CANN backend in CTranslate2, enables running AI models on the Ascend NPU extending the existing CPU & CUDA workflows. One can find more on Ascend NPU and CANN library here.
Examples of projects that support CANN include ONNX Runtime & OpenCV.
resolves #1609
Notes
- In the context of the development of this feature, we also submitted issue https://github.com/OpenNMT/CTranslate2/issues/1583 .
- In case CANN Backend support has growing popularity, a new Pull request/subproject will be introduced contributing the respective CI involving dedicated Ascend hardware.
Implementation
CANN backend support implementation introduces Device::CANN similarly to CPU & CUDA.
CANN workflow can be enabled using -DWITH_CANN=ON in cmake configuration (see examples/cann). As to CUDA, CANN can coexist alongside CPU workflow.
CANN workflow is accessible through examples (examples/cann/main.cc), cli or Python module.
Operators & primitives were implemented for CANN in order for the end-to-end example in ctranslate2 documentation to run successfully.
Tests
Tests were extended for Device::CANN & respective DataType. Additional tests were also implemented involving extra/edge cases. Gtest output: gtest_cann.log
Environment Setup
- Download CANN drivers by selecting
AArch64.runcategory (current implementation involvedCANN 7.0.RC1.alpha001). - Build image & run container as in
docker/cann.
For details about how to set up the development environment and operating environment, see Development and Operating Environment Setup and CANN Software Installation Guide.
Build CANN Python module
CANN Python module is expected to be built using the respective Docker files. Nevertheless, here we provide a quick way for building, ideal for testing and benchmarking.
#!/bin/bash
# execute from project root
rm -rf build-release/
mkdir build-release && cd build-release || exit
cmake -DWITH_CANN=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_CLI=OFF -DWITH_MKL=OFF -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH="/opt/OpenBLAS" -DWITH_OPENBLAS=ON -DWITH_RUY=ON ..
VERBOSE=1 make -j"$(nproc)" install && cd ..
export CIBW_ARCHS=aarch64
pip3 uninstall --yes ctranslate2
pip install -r python/install_requirements.txt
cd python && python3 setup.py bdist_wheel && cd ..
python3 -m pip install python/dist/ctranslate2*.whl
export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
Build CANN C++ example
#!/bin/bash
# execute from project root
# first build ct2lib
rm -rf build-release/
mkdir build-release && cd build-release || exit
cmake -DWITH_CANN=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_CLI=OFF -DWITH_MKL=OFF -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH="/opt/OpenBLAS" -DWITH_OPENBLAS=ON -DWITH_RUY=ON ..
make -j"$(nproc)"
rm CMakeCache.txt
# then build cann_run
cmake -DCMAKE_BUILD_TYPE=Release ../examples/cann/
make -j"$(nproc)"
# ./cann_run <ende_ctranslate2_path>
Samples
Python
import ctranslate2
print("get_supported_compute_types for cann: ", ctranslate2.get_supported_compute_types("cann"))
print("get_cann_device_count: ", ctranslate2.get_cann_device_count())
translator = ctranslate2.Translator("/ctranslate2_docs/ende_ctranslate2/", device="auto")
results = translator.translate_batch([["▁H", "ello", "▁world", "!"]])
output_tokens = results[0].hypotheses[0]
print(output_tokens)
> python3 ct2python_example.py
get_supported_compute_types for cann: {'int8_float16', 'int8_float32', 'int8', 'float32', 'bfloat16', 'int8_bfloat16', 'float16'}
get_cann_device_count: 8
['▁Hallo', '▁Welt', '!']
C++
Execution example in C++ can be found in examples/cann.
CLI
echo "▁H ello ▁world !" | ./ct2-translator --model "./ende_ctranslate2/"
root@90b230f7e68f /t/t/c/cli# echo "▁H ello ▁world !" | ./ct2-translator --model "./ende_ctranslate2/"
▁Hallo ▁Welt !
Benchmark
We conducted several runs measuring translation latency using all 192 CPU cores and 1 NPU device for a single batch. In specific, experiments demonstrate results for 4 consecutive runs involving 4 and 306 tokens respectively. NPU proved faster in all cases.
Input tokens
4 tokens
{{"▁H", "ello", "▁world", "!"}}
306 tokens
{{"▁In", "▁this", "▁paper", ",", "▁we", "▁speed", "▁up", "▁the", "▁context", "▁extension", "▁of", "▁L", "LM", "s", ",", "▁in", "▁two", "▁aspects", ".", "▁Particularly", ",", "▁it", "▁can", "▁be", "▁implemented", "▁with", "▁only", "▁two", "▁lines", "▁of", "▁code", "▁in", "▁training", ",", "▁while", "▁being", "▁optional", "▁in", "▁in", "fer", "ence", ".", "▁Typical", "ly" , "▁training", "▁L", "LM", "s", "▁with", "▁long", "▁context", "▁sizes", "▁is" ,"▁comp", "ut", "ation", "ally", "▁expensive", "▁requiring", "▁extensive", "▁training", "▁hours", "▁and", "▁G", "PU", "▁resources", ".", "▁On", "▁the", "▁one", "▁hand", ",", "▁although", "▁den", "se", "▁global", "▁attention", "▁is", "▁needed", "▁during", "▁in", "fer", "ence", ",", "▁fine", "-", "tun", "ing", "▁the", "▁model", "▁can" ,"▁be", "▁effectively", "▁and", "▁efficiently", "▁done", "▁by", "▁spar", "se", "▁local", "▁attention", ".", "▁In", "▁this", "▁paper", ",", "▁we", "▁speed", "▁up", "▁the", "▁context", "▁extension", "▁of", "▁L", "LM", "s", ",", "▁in", "▁two", "▁aspects", ".", "▁Particularly", ",", "▁it", "▁can", "▁be", "▁implemented", "▁with", "▁only", "▁two", "▁lines", "▁of", "▁code", "▁in", "▁training", ",", "▁while", "▁being", "▁optional", "▁in", "▁in", "fer", "ence", ".", "▁Typical", "ly" , "▁training", "▁L", "LM", "s", "▁with", "▁long", "▁context", "▁sizes", "▁is" ,"▁comp", "ut", "ation", "ally", "▁expensive", "▁requiring", "▁extensive", "▁training", "▁hours", "▁and", "▁G", "PU", "▁resources", ".", "▁On", "▁the", "▁one", "▁hand", ",", "▁although", "▁den", "se", "▁global", "▁attention", "▁is", "▁needed", "▁during", "▁in", "fer", "ence", ",", "▁fine", "-", "tun", "ing", "▁the", "▁model", "▁can" ,"▁be", "▁effectively", "▁and", "▁efficiently", "▁done", "▁by", "▁spar", "se", "▁local", "▁attention", ".", "▁In", "▁this", "▁paper", ",", "▁we", "▁speed", "▁up", "▁the", "▁context", "▁extension", "▁of", "▁L", "LM", "s", ",", "▁in", "▁two", "▁aspects", ".", "▁Particularly", ",", "▁it", "▁can", "▁be", "▁implemented", "▁with", "▁only", "▁two", "▁lines", "▁of", "▁code", "▁in", "▁training", ",", "▁while", "▁being", "▁optional", "▁in", "▁in", "fer", "ence", ".", "▁Typical", "ly" , "▁training", "▁L", "LM", "s", "▁with", "▁long", "▁context", "▁sizes", "▁is" ,"▁comp", "ut", "ation", "ally", "▁expensive", "▁requiring", "▁extensive", "▁training", "▁hours", "▁and", "▁G", "PU", "▁resources", ".", "▁On", "▁the", "▁one", "▁hand", ",", "▁although", "▁den", "se", "▁global", "▁attention", "▁is", "▁needed", "▁during", "▁in", "fer", "ence", ",", "▁fine", "-", "tun", "ing", "▁the", "▁model", "▁can" ,"▁be", "▁effectively", "▁and", "▁efficiently", "▁done", "▁by", "▁spar", "se", "▁local", "▁attention", "."}}
Hardware
CPU: arm64 Kunpeng 920 Series @2.6GHz (192 cores - utilized all)
NPU: Ascend 910A AI Processor (8 devices - utilized 1)
Experiments
| 4 tokens | cpu | cann |
|---|---|---|
| 1 | 0:00:00.098600 | 0:00:00.093737 |
| 2 | 0:00:00.098584 | 0:00:00.092929 |
| 3 | 0:00:00.131760 | 0:00:00.093115 |
| 4 | 0:00:00.109684 | 0:00:00.093026 |
| 306 tokens | cpu | cann |
|---|---|---|
| 1 | 0:00:02.437300 | 0:00:02.283184 |
| 2 | 0:00:02.468804 | 0:00:02.018239 |
| 3 | 0:00:02.469789 | 0:00:01.877654 |
| 4 | 0:00:02.744319 | 0:00:02.080763 |
cd python && python3 setup.py bdist_wheel && cd ..
for segment in segments:
File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 884, in restore_speech_timestamps
for segment in segments:
File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 396, in generate_segments
encoder_output = self.encode(segment)
File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 574, in encode
return self.model.encode(features, to_cpu=True)
RuntimeError: not implemented in CANN
我已经编译完成,但是在使用过程中报这个错误。请问有好的解决思路吗?
cd python && python3 setup.py bdist_wheel && cd ..for segment in segments: File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 884, in restore_speech_timestamps for segment in segments: File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 396, in generate_segments encoder_output = self.encode(segment) File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 574, in encode return self.model.encode(features, to_cpu=True) RuntimeError: not implemented in CANN我已经编译完成,但是在使用过程中报这个错误。请问有好的解决思路吗?
In order for faster-whisper to work, additional tensor operators have to be implemented in CANN. This is a task that's already completed from our side. Nevertheless, we didn't push it to GitHub yet due to change in priorities.
我已经编译完成,CANN 7.0.0.beta1,使用文档示例,过程中遇到此问题:
>>> import ctranslate2
>>> translator=ctranslate2.Translatro("ende_ctranslate2/", device="auto")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'ctranslate2' has no attribute 'Translatro'
>>> translator=ctranslate2.Translator("ende_ctranslate2/", device="auto")
>>> results = translator.translate_batch([["H@@", "ello", "world@@", "!"]])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CANN failed with error 100024
请问cann仅支持CANN 7.0.RC1.alpha001吗 另外该问题是否有好的思路可以解决