joshpopelka20

Results 6 issues of joshpopelka20

Is it possible to run this in the cloud?

I'm using this code to run inference: ``` // Use a pipeline as a high-level helper from transformers import pipeline // Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer...

I'm working with a long context model (gradientai/Llama-3-8B-Instruct-262k) that exceeds the memory of a single A100 GPU. While the model weights are loaded, when I try to run inference, I...

new feature
backend
models

This is the start of the RingAttention code. The changes so far have been to create multiple KV caches (if multiple num_devices) and to try to create separate chunks.

I'll work through adding it to quantized llama first, as I know that architecture the most. Link to the paper: https://arxiv.org/abs/2310.01889

new feature

I'm trying to use llama 3.1 70B to do "multi-needle in a haystack" search. Basically, I'm asking the model to use a text and search through a list of terms;...

Behavior 2.5