exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Codellama 16K context length?

Open ShahZ181 opened this issue 2 years ago • 3 comments

Has anyone gotten 16k context length with codellama or llama2? because i have tried multiple models but they all start producing gibberish when the context window gets past 4096. I am using exllama and i changed all the necessary settings to get it to work but it doesn't work.

I am running python3 data_new.py -d /home/shahrukh/Documents/vicuana-7/models/TheBloke_Airoboros-c34B-2.1-GPTQ -gs 13,13,13,0 --compress_pos_emb 2 -alpha 4 -l 8000

However when using eithe codellama or llama2 the context window cannot be increased past 4096? Does anyone know why that might be??

ShahZ181 avatar Aug 28 '23 08:08 ShahZ181

ime airoboros doesn't use compress_pos_embed and I found the best perplexity was obtained using alpha 2.7. the default 100k base gives lower results when I ran it as a lora.

8k has worked for me like that.

Ph0rk0z avatar Aug 29 '23 13:08 Ph0rk0z

What params do I need to change to support longer context? Appreciate if someone could point me to where in the docs I can find this out.

I'm getting:

RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

when I run:

# Create config, model, tokenizer and generator
config = ExLlamaConfig(model_config_path)  # create config from config.json
config.model_path = model_path  # supply path to model weights file

model = ExLlama(config)  # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)  # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)  # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)  # create generator

# Configure generator
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.05
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt="You are a helpful assistant. You are an expert on summarisation."
user_prompt="Provide a three bullet summary of the above content."

prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{escaped_transcript}\n\n{user_prompt.strip()} {E_INST}\n\n"

# # Modify the prompt to ask for a summary of escaped_transcript
# prompt = f"Please summarize the following text:\n\n{escaped_transcript}\n\n"

print(prompt, end="")

# Produce a simple generation
output = generator.generate_simple(prompt, max_new_tokens=200)

# Print the generated summary, omitting the prompt
print(output[len(prompt):])
torch.cuda.empty_cache()

RonanKMcGovern avatar Sep 08 '23 10:09 RonanKMcGovern

There are a couple of parameters in the config (ExLlamaConfig) related to context length:

  • max_seq_len is the main one. It should just be 16384 for a 16k model, if you have the VRAM to hold a cache of that size.
  • max_input_len and max_attention_size are used to limit the number of tokens forwarded through the model at once. The input to model.forward is transparently chunked so this is just a tradeoff between VRAM and speed. If you have VRAM to spare and you want more tokens per second for prompt processing, you could consider increasing it.
  • compress_pos_emb is the RoPE scaling factor. You would only change this for a model that's finetuned to use a particular value.
  • alpha_value sets the "NTK" RoPE base, related to what Meta calls "theta" now, for CodeLlama. ExLlama should automatically read the correct theta value from the config file, so you shouldn't need to change this.
  • use_flash_attn_2 may help performance on very long prompts. It hasn't been performing well at all in my tests, but if you're doing things like summaries on 16k inputs, it could be worth trying.

turboderp avatar Sep 08 '23 14:09 turboderp