ggml-cpu: Support s390x SIMD Instruction Set
This pull request aims to integrate the SIMD instruction set via vecintrin.h into llama.cpp on the s390x platform.
Currently the SIMD instruction set is included in the following ggml_vec_dot functions:
| Function | Implementation | Remarks |
|---|---|---|
| ggml_vec_dot_f32 | IMPLEMENTED | Notice a hotspot for Assembly call vector load. Will fix in another PR. |
| ggml_vec_dot_f16 | IMPLEMENTED | Notice a hotspot for Assembly call vector load. Will fix in another PR. |
| ggml_vec_dot_q4_0_q8_0 | IMPLEMENTED | |
| ggml_vec_dot_q4_1_q8_1 | IMPLEMENTED | |
| ggml_vec_dot_q8_0_q8_0 | IMPLEMENTED | |
| ggml_vec_dot_q4_K_q8_K | IMPLEMENTED | |
| ggml_vec_dot_q5_K_q8_K | IMPLEMENTED | |
| ggml_vec_dot_q6_K_q8_K | IMPLEMENTED | |
| ggml_vec_dot_iq4_nl_q8_0 | IMPLEMENTED | |
| ggml_vec_dot_iq4_xs_q8_K | IMPLEMENTED |
Verification
To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:
- Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
- Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
- Kindly request additional models for testing in this PR
Performance Results
I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.
Before SIMD Instruction Set
| model | size | parameters | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | pp512 | 16.66 ± 0.01 |
| Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | pp512 | 16.30 ± 0.02 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | pp512 | 23.31 ± 0.02 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | pp512 | 26.52 ± 0.03 |
| Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | pp512 | 29.73 ± 0.03 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | pp512 | 23.91 ± 0.05 |
| Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | pp512 | 16.73 ± 0.02 |
| Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | pp512 | 12.62 ± 0.01 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | pp512 | 23.88 ± 0.04 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | pp512 | 21.59 ± 0.03 |
| Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | tg128 | 8.20 ± 0.07 |
| Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | tg128 | 9.70 ± 0.01 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | tg128 | 14.48 ± 0.03 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | tg128 | 15.95 ± 0.06 |
| Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | tg128 | 19.80 ± 0.04 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | tg128 | 14.89 ± 0.06 |
| Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | tg128 | 10.94 ± 0.04 |
| Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | tg128 | 8.53 ± 0.02 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | tg128 | 14.38 ± 0.07 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | tg128 | 13.22 ± 0.02 |
After SIMD Instruction Set
| model | size | parameters | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | pp512 | 85.46 ± 0.09 |
| Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | pp512 | 35.39 ± 0.13 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | pp512 | 121.46 ± 0.81 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | pp512 | 123.79 ± 0.40 |
| Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | pp512 | 137.36 ± 0.52 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | pp512 | 118.88 ± 0.56 |
| Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | pp512 | 111.65 ± 0.38 |
| Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | pp512 | 101.94 ± 0.59 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | pp512 | 94.28 ± 0.18 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | pp512 | 99.43 ± 0.87 |
| Granite-3.1-1B-A400M-Instruct-BE-F32 | 4.97 GiB | 1.33 B | BLAS | 8 | tg128 | 14.27 ± 0.29 |
| Granite-3.1-1B-A400M-Instruct-BE-F16 | 2.49 GiB | 1.33 B | BLAS | 8 | tg128 | 13.97 ± 0.11 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_0 | 731.07 MiB | 1.33 B | BLAS | 8 | tg128 | 69.33 ± 1.41 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_1 | 807.57 MiB | 1.33 B | BLAS | 8 | tg128 | 65.97 ± 1.71 |
| Granite-3.1-1B-A400M-Instruct-BE-Q8_0 | 1.32 GiB | 1.33 B | BLAS | 8 | tg128 | 57.82 ± 0.60 |
| Granite-3.1-1B-A400M-Instruct-BE-Q4_K | 782.12 MiB | 1.33 B | BLAS | 8 | tg128 | 72.14 ± 0.70 |
| Granite-3.1-1B-A400M-Instruct-BE-Q5_K | 910.37 MiB | 1.33 B | BLAS | 8 | tg128 | 70.34 ± 0.69 |
| Granite-3.1-1B-A400M-Instruct-BE-Q6_K | 1.02 GiB | 1.33 B | BLAS | 8 | tg128 | 63.45 ± 0.68 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL | 737.07 MiB | 1.33 B | BLAS | 8 | tg128 | 60.09 ± 1.33 |
| Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS | 700.32 MiB | 1.33 B | BLAS | 8 | tg128 | 66.48 ± 1.29 |
[!NOTE] Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.
Please review this pull request and consider merging into the main repository. Thank you!
I have fixed all problems and have re-tested the implementation to ensure that it is working as intended. No problems so far, do let me know how should I proceed with this PR.
It appears that these failing unit tests point towards not being able to download a model from HuggingFace. In run number 2, the following error was thrown by the server unit test which points directly to the test not being able to download a model for testing.
Run number 2 error details
==================================== ERRORS ====================================
________________ ERROR at setup of test_with_and_without_draft _________________
@pytest.fixture(scope="module", autouse=True)
def fixture_create_server():
> return create_server()
unit/test_speculative.py:21:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
unit/test_speculative.py:14: in create_server
server.model_draft = download_file(MODEL_DRAFT_FILE_URL)
utils.py:410: in download_file
wget.download(url, out=output_file)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/wget.py:526: in download
(tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:241: in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:216: in urlopen
return opener.open(url, data, timeout)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:525: in open
response = meth(req, response)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:634: in http_response
response = self.parent.error(
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:563: in error
return self._call_chain(*args)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:496: in _call_chain
result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <urllib.request.HTTPDefaultErrorHandler object at 0x7f861c9f0f50>
req = <urllib.request.Request object at 0x7f861ca31ed0>
fp = <http.client.HTTPResponse object at 0x7f861ca2ab90>, code = 504
msg = 'Gateway Time-out'
hdrs = <http.client.HTTPMessage object at 0x7f861ca32c10>
def http_error_default(self, req, fp, code, msg, hdrs):
> raise HTTPError(req.full_url, code, msg, hdrs, fp)
E urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:643: HTTPError
---------------------------- Captured stdout setup -----------------------------
Downloading https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf to ./tmp/stories15M-q4_0.gguf
=========================== short test summary info ============================
ERROR unit/test_speculative.py::test_with_and_without_draft - urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
====== 122 passed, 3 skipped, 108 deselected, 1 error in 94.14s (0:01:34) ======
Error: Process completed with exit code 1.
In run number 3, the following error was thrown by server-windows unit test and it appears to be the same problem where it is unable to download a model for testing.
Run number 3 error details
0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0
100 3035 100 3035 0 0 300 0 0:00:10 0:00:10 --:--:-- 744
0.10.309.401 E common_download_file: invalid http status code received: 504
0.10.314.217 E common_iniWaiting for server to start...
-------------------------- Captured stdout teardown ---------------------------
Stopping server with pid=6332
=========================== short test summary info ===========================
FAILED unit/test_basic.py::test_server_start_simple - TimeoutError: Server did not start within 12 seconds
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 108 deselected in 12.99s ======================
Error: Process completed with exit code 1.
Both of which have the HuggingFace server return a 504 status code. I believe this does not have any relation to my code unless I am missing something here.
Do let me know how this PR can proceed with these sporadic errors occurring on the unit tests.
It appears that these failing unit tests point towards not being able to download a model from HuggingFace.
Yes, these runs fail from time to time for some reason - not related to this PR.