GLM
GLM copied to clipboard
4bit quantization of the 10b model
Is there any way to run the 4bit quantization process from GLM-130b repo on the 10b GLM model?
https://github.com/THUDM/GLM-130B/blob/main/docs/quantization.md
Because the architecture is similar, I expect the idea can work. But currently, the frameworks seem incompatible. The GLM code base doesn't use SwissArmyTransformer framework and the quantization scripts require it. Are there any small hacks I can make to fit them together?
If one can fit within 8gb of VRAM that'll make the GLM 10b models a lot more accessible for end user experimentation. Thanks!