Dampfinchen issues

Results 9 issues of


                                            Dampfinchen

6 GB VRAM?

Hello, is it possible to run the 117M GPT-2 model with 6 GB VRAM using FP16?

Improve memory management

Hello, I've noticed memory management with Oobabooga is quite poor compared to KoboldAI and Tavern. Here's some tests I've done: Kobold AI + Tavern : Running Pygmalion 6B with 6...

enhancement

Hey there. Please take a look at this code: https://github.com/AlpinDale/gptq-gptj Could you add 4bit quantization support for GPT-J? If everything is done, this would allow Pygmalion 6B to load in...

enhancement

[feature request] INT4 inference support and flexgen like offloading.

Hello. INT4 is accelerated on Ada Lovelace, Ampere and Turing GPU architectures tensor cores and can effectively halve VRAM requirements compared to INT8 (and that halves memory consumption in comparison...

feature request

Run GGUF files directly

### What are you trying to do? Why is the GGUF converted instead of just being run directly like all the other inference engines (Llama.cpp, Koboldcpp, Oobabooga, LM-Studio etc). ###...

needs-triage

Adjust prompt batch size for Exllama V2

By default Exllama V2 uses a batch of 2048 for prompt processing, which adds a ton of VRAM usage. On TabbyAPI and ExGUI it is possible to set the prompt...

enhancement

Turing support

Why is Ampere or Ada (RTX 3000 and RTX 4000 series) required to support this? Turing (RTX 2000 series) has INT4 tensor cores.

GQA for smaller models

Hello, could we please have 13b and 7b models with the updated architecture that includes grouped query attention? A lot of people are running these models on machines with low...

new-feature

Add option to keep model in VRAM instead of unloading it after each generation

I am running the FP16 version of Flux and the fp16 T5 text encoder on my RTX 2060 laptop with 32 GB RAM. I was surprised to see WebUI forge...