joshpopelka20
joshpopelka20
Is it possible to run this in the cloud?
I'm using this code to run inference: ``` // Use a pipeline as a high-level helper from transformers import pipeline // Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer...
I'm working with a long context model (gradientai/Llama-3-8B-Instruct-262k) that exceeds the memory of a single A100 GPU. While the model weights are loaded, when I try to run inference, I...
This is the start of the RingAttention code. The changes so far have been to create multiple KV caches (if multiple num_devices) and to try to create separate chunks.
I'll work through adding it to quantized llama first, as I know that architecture the most. Link to the paper: https://arxiv.org/abs/2310.01889
I'm trying to use llama 3.1 70B to do "multi-needle in a haystack" search. Basically, I'm asking the model to use a text and search through a list of terms;...