AMMO Integration with Llama2 Post-Training Quantization Example and Tests
What does this PR do ?
Integrating AMMO library to the project and providing utilities for quantizing models with Llama2 PTQ example.
Different quantization algorithms are available including INT8 SmoothQuant, INT4 AWQ, and FP8.
Main class Quantizer from the nemo.export.quantize submodule produces .qnemo tarball to be consumed by TensorRT-LLM toolbox for efficient inference. This will be a part of NeMo Framework Inference Container.
Collection: [NLP]
Changelog
- Adding nvidia-ammo package to requirements
- Adding
nemo.export.quantizesubmodule for quantizing models - Adding
tests.setupmodule to facilitate Jenkins setup - Adding PTQ test to Jenkins
Usage
Example for INT8 SmoothQuant method:
python examples/nlp/language_modeling/megatron_llama_quantization.py \
model_file=llama2-7b-fp16.nemo \
decoder_type=llama \
quantization.algorithm=int8_sq \
inference_tensor_parallel=1 \
model_save_path=llama2-7b-fp16.qnemo
Jenkins CI
To run Jenkins, a NeMo User with write access must comment jenkins on the PR.
Before your PR is "Ready for review"
Pre checks:
- [x] Make sure you read and followed Contributor guidelines
- [x] Did you write any new necessary tests?
- [x] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [x] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
For more transparent and easier review process some components were isolated into individual MRs:
- https://github.com/NVIDIA/NeMo/pull/8281
- https://github.com/NVIDIA/NeMo/pull/8429