sorasoras issues

Results 23 issues of


                                            sorasoras

是否能添加Armv8 NDK编译

我看到这里有个 SS-android的被魔改成可以用NDK编译出 armv8 https://github.com/wongsyrone/shadowsocks-android/releases 我在想这个 SSR-android 像上面的那样也支持ARMV8?

UDP2RAW的fixgro 跟UDPspeeder 是有冲突的

@wangyu- 好像新的UDP2RAW fix Gro 这个选项跟 UDPSpeeder 有冲突没法同时使用

Support for QWEN and Baichuan2 models

### Feature request recently, https://github.com/ggerganov/llama.cpp has add support for both QWEN and Baichuan2. It has added QWEN at 1610. https://github.com/ggerganov/llama.cpp/pull/4281 I have look up the Nomic Vulkan Fork of LLaMa.cpp,...

backend

models

Support for Qwen model

Are there any plan for support Qwen Model in the future? https://huggingface.co/Qwen It would be great to be able to merge multilingual model like Qwen that come with size from...

qwen 1.5 Beta 1.8B output incoherently

latest llama cpp output incoherently compare to Transformers output. transformers/vllm work ok but llama cpp gguf does not

bug-unconfirmed

Orion14b conversion issue

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that...

bug-unconfirmed

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

# Feature Description with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35×...

enhancement

stale

BurstAttention:An Efficient Distributed Attention Framework for Extremely Long Sequences

> The experimental results under different lengths demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, especially tensor parallelism (Megatron-V3) with FlashAttention, reducing 40%...

feature request

Qdora：a scalable and memory-efficient method to close the gap between parameter efficient finetuning and full finetuning.

https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html That looks awesome！

ThunderKittens：a simple yet faster flashattention alternative

ThunderKittens is an embedded domain-specific language (DSL) within CUDA designed to simplify the development of high-performance AI kernels on GPUs. It provides abstractions for working with small tiles (e.g., 16x16)...