Running Bloom
What kind of machine is required to just run the inference on the 176B model? https://huggingface.co/bigscience/bloom
Right now,
8*80GB A100or16*40GB A100[GPUs]. With the "accelerate" library you have offloading though so as long as you have enough RAM or even just disk for 300GB you're good to go (but slower).
Source: https://www.infoq.com/news/2022/07/bigscience-bloom-nlp-ai/
According to this post you can run it on consumer hardware at 3 minutes/token.
According to this post even on pretty good GPU hardware it can take 90 seconds/token though. Seems like you need really upper range systems to run it quickly.
For inference only, what are the minimum requirements for RAM and GPU memories?
About 350 GB of GPU RAM (~200 GB if you quantise to int8).
About 350 GB of GPU RAM (~200 GB if you quantise to int8). For inference only?
Yep, need to get all those parameters into GPU RAM to run inference. Like I mentioned, you can use the accelerate framework to do "swapping" from CPU RAM to GPU RAM, which lets you do it with much less GPU RAM at a ridiculous speed penalty.