Experiment - Running on AWS EC2 Graviton c6g.xlarge

Open giusedroid opened this issue 1 year ago • 0 comments

Hi folks! First of all, congrats on this fantastic research result!

I managed to run Llama3-8B-1.58-100B-tokens on a relatively small Amazon EC2 instance.
Instance Family: c6g (ARM) Instance Size: xlarge Architecture: ARM 64 Specs: 4v CPUs, 8GiB RAM OS: Ubuntu Cost: 0.1360 USD per hours in us-east-1

https://github.com/user-attachments/assets/db244081-4a83-4923-8cba-62768723c975

Build

I built the model on a larger instance (r7g.4xlarge), then made an AMI and ran it on a smaller machine. It took 21 minutes to build the model.

Run

python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0 -t 4

Notes

It took about 10 minutes to load the model for the first time, but after that I can run inference at will. Is this expected?

I had to use an experimental fork of this repo (#79), as I was experiencing the issue described in #74

All the best!

Oct 29 '24 14:10 giusedroid