Document / support for using BFLOAT16 with (Xeon) TGI service
The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3
TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
args:
- --model-id
- $(LLM_MODEL_ID)
+ - --dtype
+ - bfloat16
#- "/data/Llama-2-7b-hf"
# - "/data/Mistral-7B-Instruct-v0.2"
# - --quantize
However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.
This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu
It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.
Wikipedia has nifty table listing the platforms currently supporting AVX512 with BF16 support: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512
= Intel Cooper Lake & Sapphire Rapids, AMD Zen 4 & 5.
On platform that do not support BF16 (e.g. Ice Lake), TGI seems to still work when BF16 type is specified, but slightly slower (due to a conversion step?).
we can add info in docs to remind user close bf16 on specific machines.
The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3
TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml +++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml @@ -28,6 +29,8 @@ spec: args: - --model-id - $(LLM_MODEL_ID) + - --dtype + - bfloat16 #- "/data/Llama-2-7b-hf" # - "/data/Mistral-7B-Instruct-v0.2" # - --quantizeHowever, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.
This can be automated by using
node-feature-discoveryand its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpuIt would be good to add some documentation and examples (e.g. comment lines in YAML) for this.
hi @eero-t the node-feature-discovery plugin can help select node(cpu) by labeling node with CPU features. But it needs create a pod.
we push a pr to provide the recipe to label node and setup tgi with bfloat16, see https://github.com/opea-project/GenAIExamples/pull/795
Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:
- https://github.com/opea-project/GenAIInfra/pull/402/files#diff-9ff19985b33716e4f25db59c4f8a9c5611b54034dda7108860e9131b30444b8b
- https://github.com/opea-project/GenAIInfra/pull/386#discussion_r1753682590
Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:
- https://github.com/opea-project/GenAIInfra/pull/402/files#diff-9ff19985b33716e4f25db59c4f8a9c5611b54034dda7108860e9131b30444b8b
- HPA improvements GenAIInfra#386 (comment)
That's on our plan
We add bf16 in Readme of docker