GenAIExamples Document / support for using BFLOAT16 with (Xeon) TGI service

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:

--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize

However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

Jun 26 '24 17:06 eero-t

Wikipedia has nifty table listing the platforms currently supporting AVX512 with BF16 support: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512

= Intel Cooper Lake & Sapphire Rapids, AMD Zen 4 & 5.

On platform that do not support BF16 (e.g. Ice Lake), TGI seems to still work when BF16 type is specified, but slightly slower (due to a conversion step?).

Jun 28 '24 08:06 eero-t

we can add info in docs to remind user close bf16 on specific machines.

Aug 07 '24 06:08 kevinintel

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize
However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

hi @eero-t the node-feature-discovery plugin can help select node(cpu) by labeling node with CPU features. But it needs create a pod.

we push a pr to provide the recipe to label node and setup tgi with bfloat16, see https://github.com/opea-project/GenAIExamples/pull/795

Sep 12 '24 05:09 lkk12014402

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

https://github.com/opea-project/GenAIInfra/pull/402/files#diff-9ff19985b33716e4f25db59c4f8a9c5611b54034dda7108860e9131b30444b8b
https://github.com/opea-project/GenAIInfra/pull/386#discussion_r1753682590

Sep 12 '24 08:09 eero-t

Examples manifests are generated from Infra project Helm charts. Shouldn't there rather be Helm support for enabling it? See:

https://github.com/opea-project/GenAIInfra/pull/402/files#diff-9ff19985b33716e4f25db59c4f8a9c5611b54034dda7108860e9131b30444b8b

HPA improvements GenAIInfra#386 (comment)

That's on our plan

Sep 12 '24 08:09 lianhao

We add bf16 in Readme of docker

Sep 19 '24 02:09 kevinintel