mllm Prefill speed is approximately 4~6 tokens/s for Qwen1.5-1.8B

Hi, mllm-qnn can work on my device oppo findx7 ultra(snapdragon 8gen 3+16G RAM). However, the prefill speed for Qwen1.5-1.8B is approximately 4-6 tokens per second, which significantly diverges from the 1000 tokens per second claimed in the paper. Based on our tests, npuExe.run takes approximately 15 seconds to process 64 tokens:

        auto startTime = currentMs();

        do {
            // 1: Prefill stage using NPU chunk execute
            npuExe.run(npu_ctx, &npuNet, {input});
	        auto result = npuExe.result();

        int duration = (int) (currentMs() - startTime);
         std::cout << "input_tensor.sequence()=" << input_tensor.sequence() << std::endl;
        std::cout << "prefill cost: " << duration << "ms prefill speed: " << input_tensor.sequence() * 1000 / duration << "token/s" << std::endl;

Could you provide some suggestions?

Aug 14 '24 03:08 mengllm

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

Aug 14 '24 03:08 oreomaker

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

Aug 14 '24 03:08 mengllm

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

@oreomaker Could you update your test result for prefilling stage?

Aug 16 '24 02:08 mengllm

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

The current released code is a very preliminary version of our NPU support (as noted in readme ) Many of the techniques in our paper are not integrated yet, and there exists a few performance issues that need more engineering efforts to be fixed. We are still working on to deliver the promised prefill speed, and please stay tuned.

Aug 16 '24 03:08 oreomaker

In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing time, as well as the computing speed is still under improving.

I fully understand the situation; however, I am curious about how to obtain the test result of 1000 tokens per second.

The current released code is a very preliminary version of our NPU support (as noted in readme ) Many of the techniques in our paper are not integrated yet, and there exists a few performance issues that need more engineering efforts to be fixed. We are still working on to deliver the promised prefill speed, and please stay tuned.

It's great works [1000 t/s prefill speed] on Heaxgon NPU , but is there a roadmap to indicate when the prefill speed can be available memtioned in the paper.

Aug 24 '24 08:08 liangzelang