Holden X issues

Results 16 issues of


                                            Holden X

Reclaim memory from offloaded model weights

Even though some neurons in the FFN have been split onto the GPU, their weights are still retained on the CPU in current implementations, which is unnecessary and occupies an...

enhancement

Kernel fusion to reduce communication overhead

1. For FFN networks, an unnecessary synchronization is introduced between the CPU and GPU hybrid computing of FC1 and FC2. 2. In the open-sourced code, the selective synchronization is not...

enhancement

Optimize CUDA sparse operator with Tensor Core

As for now, PowerInfer uses CUDA cores for sparse operator computation, which is not efficient for prompt phase computation. In order to further support multi batch services, PowerInfer plans to...

enhancement

Further optimisation of hybrid inference

Related issues/proposals: - [ ] #95 - [ ] #96 - [ ] #97

tracker

Meta: Wider model support for PowerInfer

PowerInfer currently optimizes for LLMs (Large Language Models) that utilize the ReLU activation function, leveraging their internal activation locality. However, many of the trending models do not use ReLU activation,...

tracker

Meta: Implementing hybrid inference across key desktop platforms

As we embark on the initial phase of PowerInfer's development, our primary goal is to introduce the hybrid inference feature across all major desktop hardware and software platforms. Our current...

tracker

macOS/Metal inference support

To fully harness the power of Mac, especially on M Chips, integrating Metal backend is key. The core task ahead is adapting our key sparse operators, including `mul_mat_sparse` and `axpy`,...

tracker

WSL + CUDA issues

PowerInfer encountered unexpected errors in WSL, mostly due to CUDA APIs. Related issues: #42, #46, #63. Since our WSL test bed has been set up, we can try to reproduce...

tracker

Fix offloading / VRAM budget bugs

After releasing online FFN offloading, we have found new issues in: - [x] Decoding bug: #77. - [x] Python module issue: #55, #78. - [ ] Inaccuracy when offloading under...

tracker

Windows CPU/GPU support

It requires modifying a series of POSIX API calls that are supported by MSVC to build PowerInfer on Windows under CPU inference and hybrid inference mode, including: * Atomic operations...

tracker