[Bug] [RISC-V RVV] cos operator shows slight performance degradation

Open yanyanyanggg opened this issue 2 months ago • 0 comments

Issue: [RISC-V RVV] cos operator shows slight performance degradation

Description

The cosine operator shows minor performance degradation with the RISC‑V Vector (RVV) extension, achieving 0.981× the performance of the scalar implementation. While the regression is small, it still indicates room for optimization in vectorized trigonometric functions.

Steps to Reproduce

Generate the cos operator with the following configuration:

params = {
    "dtype": "float32",
    "batch": 14,
    "channels": 23,
    "input_height": 67,
    "input_width": 99
}

Export the operator to two targets:

RV target (scalar, without vector extension):

llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c

RVV target (with vector extension):

llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c,+v

Run performance measurement on both targets.

Operator definition code:

def export_cos(params, set_dir=None, platform="rv"):
    data = relay.var("data",
                     shape=(params["batch"], params["channels"],
                            params["input_height"], params["input_width"]),
                     dtype=params["dtype"])
    cos_op = relay.cos(data)
    export_op(cos_op, params["op_name"], [data], params, set_dir=set_dir)

Performance Data

RV execution time: 15.894500 ms
RVV execution time: 16.210500 ms
Acceleration ratio (RV/RVV): 0.981 (RVV is ~1.02× slower)

Environment Information

TVM version: 0.19.0
LLVM version: [Please provide: llvm-config --version]
Hardware: Spacemit K1‑X bit‑brick board
CPU: Spacemit X60 (8 cores, 1.6 GHz)
ISA: rv64imafdcv (with vector extensions)
Memory: 7.6 GB
OS: Bianbu 2.2, Linux kernel 6.6.63
Operation: Elementwise cosine on ~1.7M elements

Expected Behavior

RVV vectorization should provide a performance improvement over the scalar RV baseline for trigonometric functions like cosine.

Additional Context

The cos operation is applied elementwise to a tensor of ~1.7M elements.
While the performance regression is minimal compared to other operators, it still shows that vectorization is not providing the expected speedup. This suggests that even for operations that are computationally intensive, the current RVV vectorization may not be optimal.
This issue is part of a broader pattern where all tested operators (including sum, log, relu, bias_add, sqrt, floor, round, avg_pool2d, sigmoid, softmax, negative, max_pool2d, and cos) show performance degradation with RVV, indicating a potential systemic issue in TVM's RVV code generation or optimization.

Dec 09 '25 04:12 yanyanyanggg