Experiment - Running on AWS EC2 Graviton c6g.xlarge
Hi folks! First of all, congrats on this fantastic research result!
I managed to run Llama3-8B-1.58-100B-tokens on a relatively small Amazon EC2 instance.
Instance Family: c6g (ARM)
Instance Size: xlarge
Architecture: ARM 64
Specs: 4v CPUs, 8GiB RAM
OS: Ubuntu
Cost: 0.1360 USD per hours in us-east-1
https://github.com/user-attachments/assets/db244081-4a83-4923-8cba-62768723c975
Build
I built the model on a larger instance (r7g.4xlarge), then made an AMI and ran it on a smaller machine. It took 21 minutes to build the model.
Run
python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0 -t 4
Notes
It took about 10 minutes to load the model for the first time, but after that I can run inference at will. Is this expected?
I had to use an experimental fork of this repo (#79), as I was experiencing the issue described in #74
All the best!