joshpopelka20 issues

Results 6 issues of


                                            joshpopelka20

Running inference in the cloud

Is it possible to run this in the cloud?

Adding threshold to Transformers pipeline

I'm using this code to run inference: ``` // Use a pipeline as a high-level helper from transformers import pipeline // Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer...

Cross GPU device mapping feature

I'm working with a long context model (gradientai/Llama-3-8B-Instruct-262k) that exceeds the memory of a single A100 GPU. While the model weights are loaded, when I try to run inference, I...

new feature

backend

models

Initial KV RingAttention code

This is the start of the RingAttention code. The changes so far have been to create multiple KV caches (if multiple num_devices) and to try to create separate chunks.

[Feature] Implementation of multi-gpu KV cache (RingAttention)

I'll work through adding it to quantized llama first, as I know that architecture the most. Link to the paper: https://arxiv.org/abs/2310.01889

new feature

Dspy for multineedle in a haystack

I'm trying to use llama 3.1 70B to do "multi-needle in a haystack" search. Basically, I'm asking the model to use a text and search through a list of terms;...

Behavior 2.5