Holden X
Holden X
Even though some neurons in the FFN have been split onto the GPU, their weights are still retained on the CPU in current implementations, which is unnecessary and occupies an...
1. For FFN networks, an unnecessary synchronization is introduced between the CPU and GPU hybrid computing of FC1 and FC2. 2. In the open-sourced code, the selective synchronization is not...
As for now, PowerInfer uses CUDA cores for sparse operator computation, which is not efficient for prompt phase computation. In order to further support multi batch services, PowerInfer plans to...
Related issues/proposals: - [ ] #95 - [ ] #96 - [ ] #97
PowerInfer currently optimizes for LLMs (Large Language Models) that utilize the ReLU activation function, leveraging their internal activation locality. However, many of the trending models do not use ReLU activation,...
As we embark on the initial phase of PowerInfer's development, our primary goal is to introduce the hybrid inference feature across all major desktop hardware and software platforms. Our current...
To fully harness the power of Mac, especially on M Chips, integrating Metal backend is key. The core task ahead is adapting our key sparse operators, including `mul_mat_sparse` and `axpy`,...
PowerInfer encountered unexpected errors in WSL, mostly due to CUDA APIs. Related issues: #42, #46, #63. Since our WSL test bed has been set up, we can try to reproduce...
After releasing online FFN offloading, we have found new issues in: - [x] Decoding bug: #77. - [x] Python module issue: #55, #78. - [ ] Inaccuracy when offloading under...
It requires modifying a series of POSIX API calls that are supported by MSVC to build PowerInfer on Windows under CPU inference and hybrid inference mode, including: * Atomic operations...