CANN Backend support

Introduction

CANN (Compute Architecture of Neural Networks), developed by Huawei, is a heterogeneous computing architecture for AI scenarios. It provides multi-layer programming interfaces to help users quickly build AI applications and services based on the Ascend platform.

CANN backend in CTranslate2, enables running AI models on the Ascend NPU extending the existing CPU & CUDA workflows. One can find more on Ascend NPU and CANN library here.

Examples of projects that support CANN include ONNX Runtime & OpenCV.

resolves #1609

Notes

In the context of the development of this feature, we also submitted issue https://github.com/OpenNMT/CTranslate2/issues/1583 .
In case CANN Backend support has growing popularity, a new Pull request/subproject will be introduced contributing the respective CI involving dedicated Ascend hardware.

Implementation

CANN backend support implementation introduces Device::CANN similarly to CPU & CUDA. CANN workflow can be enabled using -DWITH_CANN=ON in cmake configuration (see examples/cann). As to CUDA, CANN can coexist alongside CPU workflow.

CANN workflow is accessible through examples (examples/cann/main.cc), cli or Python module. Operators & primitives were implemented for CANN in order for the end-to-end example in ctranslate2 documentation to run successfully.

Tests

Tests were extended for Device::CANN & respective DataType. Additional tests were also implemented involving extra/edge cases. Gtest output: gtest_cann.log

Environment Setup

Download CANN drivers by selecting AArch64.run category (current implementation involved CANN 7.0.RC1.alpha001).
Build image & run container as in docker/cann.

For details about how to set up the development environment and operating environment, see Development and Operating Environment Setup and CANN Software Installation Guide.

Build CANN Python module

CANN Python module is expected to be built using the respective Docker files. Nevertheless, here we provide a quick way for building, ideal for testing and benchmarking.

#!/bin/bash

# execute from project root 
rm -rf build-release/
mkdir build-release && cd build-release || exit
 
cmake -DWITH_CANN=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_CLI=OFF -DWITH_MKL=OFF -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH="/opt/OpenBLAS" -DWITH_OPENBLAS=ON -DWITH_RUY=ON ..

VERBOSE=1 make -j"$(nproc)" install && cd ..  

export CIBW_ARCHS=aarch64  
pip3 uninstall --yes ctranslate2

pip install -r python/install_requirements.txt

cd python && python3 setup.py bdist_wheel && cd ..

python3 -m pip install python/dist/ctranslate2*.whl

export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}

Build CANN C++ example

#!/bin/bash

# execute from project root

# first build ct2lib
rm -rf build-release/
mkdir build-release && cd build-release || exit

cmake -DWITH_CANN=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_CLI=OFF -DWITH_MKL=OFF -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH="/opt/OpenBLAS" -DWITH_OPENBLAS=ON -DWITH_RUY=ON ..

make -j"$(nproc)"

rm CMakeCache.txt

# then build cann_run
cmake -DCMAKE_BUILD_TYPE=Release ../examples/cann/

make -j"$(nproc)"
# ./cann_run <ende_ctranslate2_path>

Samples

Python

import ctranslate2 

print("get_supported_compute_types for cann: ", ctranslate2.get_supported_compute_types("cann")) 
print("get_cann_device_count: ", ctranslate2.get_cann_device_count())
 
translator = ctranslate2.Translator("/ctranslate2_docs/ende_ctranslate2/", device="auto") 
    
results = translator.translate_batch([["▁H", "ello", "▁world", "!"]])  
output_tokens = results[0].hypotheses[0]
print(output_tokens)

> python3 ct2python_example.py
get_supported_compute_types for cann:  {'int8_float16', 'int8_float32', 'int8', 'float32', 'bfloat16', 'int8_bfloat16', 'float16'}
get_cann_device_count:  8 
['▁Hallo', '▁Welt', '!']

C++

Execution example in C++ can be found in examples/cann.

CLI

echo "▁H ello ▁world !" | ./ct2-translator --model "./ende_ctranslate2/"

root@90b230f7e68f /t/t/c/cli# echo  "▁H ello ▁world !" | ./ct2-translator --model "./ende_ctranslate2/"
▁Hallo ▁Welt !

Benchmark

We conducted several runs measuring translation latency using all 192 CPU cores and 1 NPU device for a single batch. In specific, experiments demonstrate results for 4 consecutive runs involving 4 and 306 tokens respectively. NPU proved faster in all cases.

Input tokens

4 tokens

{{"▁H", "ello", "▁world", "!"}}

306 tokens

{{"▁In", "▁this", "▁paper", ",", "▁we", "▁speed", "▁up", "▁the", "▁context", "▁extension", "▁of", "▁L", "LM", "s", ",", "▁in", "▁two", "▁aspects", ".", "▁Particularly", ",", "▁it", "▁can", "▁be", "▁implemented", "▁with", "▁only", "▁two", "▁lines", "▁of", "▁code", "▁in", "▁training", ",", "▁while", "▁being", "▁optional", "▁in", "▁in", "fer", "ence", ".", "▁Typical", "ly" , "▁training", "▁L", "LM", "s", "▁with", "▁long", "▁context", "▁sizes", "▁is" ,"▁comp", "ut", "ation", "ally", "▁expensive", "▁requiring", "▁extensive", "▁training", "▁hours", "▁and", "▁G", "PU", "▁resources", ".", "▁On", "▁the", "▁one", "▁hand", ",", "▁although", "▁den", "se", "▁global", "▁attention", "▁is", "▁needed", "▁during", "▁in", "fer", "ence", ",", "▁fine", "-", "tun", "ing", "▁the", "▁model", "▁can" ,"▁be", "▁effectively", "▁and", "▁efficiently", "▁done", "▁by", "▁spar", "se", "▁local", "▁attention", ".", "▁In", "▁this", "▁paper", ",", "▁we", "▁speed", "▁up", "▁the", "▁context", "▁extension", "▁of", "▁L", "LM", "s", ",", "▁in", "▁two", "▁aspects", ".", "▁Particularly", ",", "▁it", "▁can", "▁be", "▁implemented", "▁with", "▁only", "▁two", "▁lines", "▁of", "▁code", "▁in", "▁training", ",", "▁while", "▁being", "▁optional", "▁in", "▁in", "fer", "ence", ".", "▁Typical", "ly" , "▁training", "▁L", "LM", "s", "▁with", "▁long", "▁context", "▁sizes", "▁is" ,"▁comp", "ut", "ation", "ally", "▁expensive", "▁requiring", "▁extensive", "▁training", "▁hours", "▁and", "▁G", "PU", "▁resources", ".", "▁On", "▁the", "▁one", "▁hand", ",", "▁although", "▁den", "se", "▁global", "▁attention", "▁is", "▁needed", "▁during", "▁in", "fer", "ence", ",", "▁fine", "-", "tun", "ing", "▁the", "▁model", "▁can" ,"▁be", "▁effectively", "▁and", "▁efficiently", "▁done", "▁by", "▁spar", "se", "▁local", "▁attention", ".", "▁In", "▁this", "▁paper", ",", "▁we", "▁speed", "▁up", "▁the", "▁context", "▁extension", "▁of", "▁L", "LM", "s", ",", "▁in", "▁two", "▁aspects", ".", "▁Particularly", ",", "▁it", "▁can", "▁be", "▁implemented", "▁with", "▁only", "▁two", "▁lines", "▁of", "▁code", "▁in", "▁training", ",", "▁while", "▁being", "▁optional", "▁in", "▁in", "fer", "ence", ".", "▁Typical", "ly" , "▁training", "▁L", "LM", "s", "▁with", "▁long", "▁context", "▁sizes", "▁is" ,"▁comp", "ut", "ation", "ally", "▁expensive", "▁requiring", "▁extensive", "▁training", "▁hours", "▁and", "▁G", "PU", "▁resources", ".", "▁On", "▁the", "▁one", "▁hand", ",", "▁although", "▁den", "se", "▁global", "▁attention", "▁is", "▁needed", "▁during", "▁in", "fer", "ence", ",", "▁fine", "-", "tun", "ing", "▁the", "▁model", "▁can" ,"▁be", "▁effectively", "▁and", "▁efficiently", "▁done", "▁by", "▁spar", "se", "▁local", "▁attention", "."}}

Hardware

CPU: arm64 Kunpeng 920 Series @2.6GHz (192 cores - utilized all)
NPU: Ascend 910A AI Processor (8 devices - utilized 1)

Experiments

4 tokens	cpu	cann
1	0:00:00.098600	0:00:00.093737
2	0:00:00.098584	0:00:00.092929
3	0:00:00.131760	0:00:00.093115
4	0:00:00.109684	0:00:00.093026

306 tokens	cpu	cann
1	0:00:02.437300	0:00:02.283184
2	0:00:02.468804	0:00:02.018239
3	0:00:02.469789	0:00:01.877654
4	0:00:02.744319	0:00:02.080763

Jan 26 '24 14:01 3manifold

cd python && python3 setup.py bdist_wheel && cd ..

    for segment in segments:
  File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 884, in restore_speech_timestamps
    for segment in segments:
  File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 396, in generate_segments
    encoder_output = self.encode(segment)
  File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 574, in encode
    return self.model.encode(features, to_cpu=True)
RuntimeError: not implemented in CANN

我已经编译完成，但是在使用过程中报这个错误。请问有好的解决思路吗?

Jun 04 '24 06:06 fallbernana123456

cd python && python3 setup.py bdist_wheel && cd ..

    for segment in segments:
  File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 884, in restore_speech_timestamps
    for segment in segments:
  File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 396, in generate_segments
    encoder_output = self.encode(segment)
  File "/root/exit/envs/python39/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 574, in encode
    return self.model.encode(features, to_cpu=True)
RuntimeError: not implemented in CANN

我已经编译完成，但是在使用过程中报这个错误。请问有好的解决思路吗?

In order for faster-whisper to work, additional tensor operators have to be implemented in CANN. This is a task that's already completed from our side. Nevertheless, we didn't push it to GitHub yet due to change in priorities.

Jun 13 '24 10:06 3manifold

我已经编译完成，CANN 7.0.0.beta1，使用文档示例，过程中遇到此问题：

>>> import ctranslate2
>>> translator=ctranslate2.Translatro("ende_ctranslate2/", device="auto")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'ctranslate2' has no attribute 'Translatro'
>>> translator=ctranslate2.Translator("ende_ctranslate2/", device="auto")
>>> results = translator.translate_batch([["H@@", "ello", "world@@", "!"]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CANN failed with error 100024

请问cann仅支持CANN 7.0.RC1.alpha001吗另外该问题是否有好的思路可以解决

Jul 01 '24 02:07 LIBIN-K