Added support for the ArcticForCausalLM.
Fixes #6877
Contains the following changes:
- increases maximum number of experts from 60 to 128
- adds new tensor type FFN_NORM_EXP (for a normalization block before MoE that runs in parallel to the attention + FFN, see #6877 for details)
- introduces architecture-specific block mappings in gguf-py (details in #6877)
- adds new model type MODEL_10B_128x3_66B
- adds new ARCTIC architecture and a general support for models based on this architecture
Model files for testing: https://huggingface.co/sszymczyk/snowflake-arctic-instruct-GGUF
📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=8432.03ms p(95)=21124.36ms fails=, finish reason: stop=502 truncated=53
- Prompt processing (pp): avg=94.97tk/s p(95)=428.46tk/s
- Token generation (tg): avg=47.12tk/s p(95)=47.24tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=snowflake-arctic-clean commit=602c80d918e609f8bd5120fcd346242ed2da5f74
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1716550652 --> 1716551278
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 334.75, 334.75, 334.75, 334.75, 334.75, 863.81, 863.81, 863.81, 863.81, 863.81, 868.82, 868.82, 868.82, 868.82, 868.82, 871.14, 871.14, 871.14, 871.14, 871.14, 904.55, 904.55, 904.55, 904.55, 904.55, 909.66, 909.66, 909.66, 909.66, 909.66, 905.19, 905.19, 905.19, 905.19, 905.19, 914.01, 914.01, 914.01, 914.01, 914.01, 900.79, 900.79, 900.79, 900.79, 900.79, 901.79, 901.79, 901.79, 901.79, 901.79, 922.29, 922.29, 922.29, 922.29, 922.29, 960.88, 960.88, 960.88, 960.88, 960.88, 961.82, 961.82, 961.82, 961.82, 961.82, 894.05, 894.05, 894.05, 894.05, 894.05, 873.11, 873.11, 873.11, 873.11, 873.11, 876.76, 876.76, 876.76, 876.76, 876.76, 878.0, 878.0, 878.0, 878.0, 878.0, 883.5, 883.5, 883.5, 883.5, 883.5, 891.8, 891.8, 891.8, 891.8, 891.8, 892.81, 892.81, 892.81, 892.81, 892.81, 897.42, 897.42, 897.42, 897.42, 897.42, 898.64, 898.64, 898.64, 898.64, 898.64, 899.0, 899.0, 899.0, 899.0, 899.0, 908.42, 908.42, 908.42, 908.42, 908.42, 908.69, 908.69, 908.69, 908.69, 908.69, 907.17, 907.17, 907.17, 907.17, 907.17, 897.12, 897.12, 897.12, 897.12, 897.12, 893.81, 893.81, 893.81, 893.81, 893.81, 892.62, 892.62, 892.62, 892.62, 892.62, 897.8, 897.8, 897.8, 897.8, 897.8, 895.68, 895.68, 895.68, 895.68, 895.68, 893.83, 893.83, 893.83, 893.83, 893.83, 896.62, 896.62, 896.62, 896.62, 896.62, 903.93, 903.93, 903.93, 903.93, 903.93, 907.39, 907.39, 907.39, 907.39, 907.39, 906.57, 906.57, 906.57, 906.57, 906.57, 899.74, 899.74, 899.74, 899.74, 899.74, 896.94, 896.94, 896.94, 896.94, 896.94, 895.63, 895.63, 895.63, 895.63, 895.63, 896.21, 896.21, 896.21, 896.21, 896.21, 897.06, 897.06, 897.06, 897.06, 897.06, 868.95, 868.95, 868.95, 868.95, 868.95, 859.92, 859.92, 859.92, 859.92, 859.92, 857.2, 857.2, 857.2, 857.2, 857.2, 856.32, 856.32, 856.32, 856.32, 856.32, 859.85, 859.85, 859.85, 859.85, 859.85, 861.05, 861.05, 861.05, 861.05, 861.05, 859.59, 859.59, 859.59, 859.59, 859.59, 862.21, 862.21, 862.21, 862.21, 862.21, 861.3, 861.3, 861.3, 861.3, 861.3, 863.12, 863.12, 863.12, 863.12, 863.12, 860.35, 860.35, 860.35, 860.35, 860.35, 861.79, 861.79, 861.79, 861.79, 861.79, 865.67, 865.67, 865.67, 865.67, 865.67, 866.09, 866.09, 866.09, 866.09, 866.09, 865.61, 865.61, 865.61, 865.61, 865.61, 866.4, 866.4, 866.4, 866.4, 866.4, 867.39, 867.39, 867.39, 867.39, 867.39, 868.18, 868.18, 868.18, 868.18, 868.18, 868.62, 868.62, 868.62, 868.62, 868.62, 868.05, 868.05, 868.05, 868.05]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1716550652 --> 1716551278
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.25, 43.25, 43.25, 43.25, 43.25, 26.51, 26.51, 26.51, 26.51, 26.51, 28.18, 28.18, 28.18, 28.18, 28.18, 31.79, 31.79, 31.79, 31.79, 31.79, 32.94, 32.94, 32.94, 32.94, 32.94, 33.5, 33.5, 33.5, 33.5, 33.5, 34.81, 34.81, 34.81, 34.81, 34.81, 34.94, 34.94, 34.94, 34.94, 34.94, 34.99, 34.99, 34.99, 34.99, 34.99, 34.96, 34.96, 34.96, 34.96, 34.96, 34.87, 34.87, 34.87, 34.87, 34.87, 34.85, 34.85, 34.85, 34.85, 34.85, 33.54, 33.54, 33.54, 33.54, 33.54, 33.52, 33.52, 33.52, 33.52, 33.52, 32.27, 32.27, 32.27, 32.27, 32.27, 30.91, 30.91, 30.91, 30.91, 30.91, 30.85, 30.85, 30.85, 30.85, 30.85, 31.21, 31.21, 31.21, 31.21, 31.21, 31.15, 31.15, 31.15, 31.15, 31.15, 30.93, 30.93, 30.93, 30.93, 30.93, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 30.88, 31.08, 31.08, 31.08, 31.08, 31.08, 30.95, 30.95, 30.95, 30.95, 30.95, 31.07, 31.07, 31.07, 31.07, 31.07, 31.15, 31.15, 31.15, 31.15, 31.15, 31.1, 31.1, 31.1, 31.1, 31.1, 31.06, 31.06, 31.06, 31.06, 31.06, 31.38, 31.38, 31.38, 31.38, 31.38, 31.58, 31.58, 31.58, 31.58, 31.58, 31.66, 31.66, 31.66, 31.66, 31.66, 31.69, 31.69, 31.69, 31.69, 31.69, 31.84, 31.84, 31.84, 31.84, 31.84, 31.85, 31.85, 31.85, 31.85, 31.85, 31.64, 31.64, 31.64, 31.64, 31.64, 31.41, 31.41, 31.41, 31.41, 31.41, 30.95, 30.95, 30.95, 30.95, 30.95, 30.93, 30.93, 30.93, 30.93, 30.93, 30.97, 30.97, 30.97, 30.97, 30.97, 31.06, 31.06, 31.06, 31.06, 31.06, 31.16, 31.16, 31.16, 31.16, 31.16, 31.32, 31.32, 31.32, 31.32, 31.32, 31.07, 31.07, 31.07, 31.07, 31.07, 30.61, 30.61, 30.61, 30.61, 30.61, 30.11, 30.11, 30.11, 30.11, 30.11, 29.98, 29.98, 29.98, 29.98, 29.98, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.85, 29.9, 29.9, 29.9, 29.9, 29.9, 29.91, 29.91, 29.91, 29.91, 29.91, 29.99, 29.99, 29.99, 29.99, 29.99, 29.96, 29.96, 29.96, 29.96, 29.96, 29.9, 29.9, 29.9, 29.9, 29.9, 29.83, 29.83, 29.83, 29.83, 29.83, 29.9, 29.9, 29.9, 29.9, 29.9, 30.0, 30.0, 30.0, 30.0, 30.0, 30.09, 30.09, 30.09, 30.09, 30.09, 30.16, 30.16, 30.16, 30.16, 30.16, 30.28, 30.28, 30.28, 30.28, 30.28, 30.26, 30.26, 30.26, 30.26, 30.26, 30.27, 30.27, 30.27, 30.27]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1716550652 --> 1716551278
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.27, 0.27, 0.27, 0.27, 0.27, 0.41, 0.41, 0.41, 0.41, 0.41, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.26, 0.26, 0.26, 0.26, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.34, 0.34, 0.34, 0.34, 0.34, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.38, 0.38, 0.38, 0.38, 0.38, 0.64, 0.64, 0.64, 0.64, 0.64, 0.44, 0.44, 0.44, 0.44, 0.44, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 555 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1716550652 --> 1716551278
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0]
It's possible to only offload dense part of the model onto GPU
I noticed that the arctic model doesn't use bias tensors, so I removed usage of bias tensors in the LLM_ARCH_ARCTIC-related code (they were all nulls anyway).
I haven't tested as well, but it seems good so feel free to merge
I haven't tested as well, but it seems good so feel free to merge
@ggerganov I noticed that Snowflake changed the Arctic model 2 weeks ago. The commit says: "Fixes for GQA support" and num_key_value_heads in config.json changed value from 56 to 8, so I have to redownload the model and check if it still works.