cold-compress
cold-compress copied to clipboard
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.
Does your codes have function to analyze attention scores, or needs to be observed in the **Transformer** class
Thank you for providing the code to easily test various KV-related algorithms. I have a question regarding evaluation. I compared evaluations through truthfulQA. Accuracy was recorded in "truthfulqa_metrics.json". When the...
SnapKV
Hello, I see SnapKV is used for the Heavy Hitter Prompt Compression strategy. As far as I understand (correct me if I'm wrong), it is also used in the benchmarks...
I'm getting the error below when running any torch code. This is probably due to an incompatible cuda version (requirements.txt specifies cu121). I would suggest to - ether add a...
Other modifications worth mentioning: - Changed `scripts/convert_hf_checkpoint.py` to support loading of finetuned Llama-3 models from safetensors state dict - Added finetuned configs to `model.py` (Finetuned models use a vocab size...
https://arxiv.org/abs/2407.21018
https://arxiv.org/abs/2405.12532
[InfLLM](https://arxiv.org/abs/2402.04617)
Implement this [paper](https://arxiv.org/abs/2407.02490). Similar to `class KVCacheFastGen` in that it involves a profiling step.
untested