Dampfinchen

Results 9 issues of Dampfinchen

Hello, is it possible to run the 117M GPT-2 model with 6 GB VRAM using FP16?

Hello, I've noticed memory management with Oobabooga is quite poor compared to KoboldAI and Tavern. Here's some tests I've done: Kobold AI + Tavern : Running Pygmalion 6B with 6...

enhancement

Hey there. Please take a look at this code: https://github.com/AlpinDale/gptq-gptj Could you add 4bit quantization support for GPT-J? If everything is done, this would allow Pygmalion 6B to load in...

enhancement

Hello. INT4 is accelerated on Ada Lovelace, Ampere and Turing GPU architectures tensor cores and can effectively halve VRAM requirements compared to INT8 (and that halves memory consumption in comparison...

feature request

### What are you trying to do? Why is the GGUF converted instead of just being run directly like all the other inference engines (Llama.cpp, Koboldcpp, Oobabooga, LM-Studio etc). ###...

needs-triage

By default Exllama V2 uses a batch of 2048 for prompt processing, which adds a ton of VRAM usage. On TabbyAPI and ExGUI it is possible to set the prompt...

enhancement

Why is Ampere or Ada (RTX 3000 and RTX 4000 series) required to support this? Turing (RTX 2000 series) has INT4 tensor cores.

Hello, could we please have 13b and 7b models with the updated architecture that includes grouped query attention? A lot of people are running these models on machines with low...

new-feature

I am running the FP16 version of Flux and the fp16 T5 text encoder on my RTX 2060 laptop with 32 GB RAM. I was surprised to see WebUI forge...