llama.cpp ggml-cpu: Support s390x SIMD Instruction Set

This pull request aims to integrate the SIMD instruction set via vecintrin.h into llama.cpp on the s390x platform. Currently the SIMD instruction set is included in the following ggml_vec_dot functions:

Function	Implementation	Remarks
ggml_vec_dot_f32	IMPLEMENTED	Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_f16	IMPLEMENTED	Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_q4_0_q8_0	IMPLEMENTED
ggml_vec_dot_q4_1_q8_1	IMPLEMENTED
ggml_vec_dot_q8_0_q8_0	IMPLEMENTED
ggml_vec_dot_q4_K_q8_K	IMPLEMENTED
ggml_vec_dot_q5_K_q8_K	IMPLEMENTED
ggml_vec_dot_q6_K_q8_K	IMPLEMENTED
ggml_vec_dot_iq4_nl_q8_0	IMPLEMENTED
ggml_vec_dot_iq4_xs_q8_K	IMPLEMENTED

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.

Before SIMD Instruction Set

model	size	parameters	backend	threads	test	t/s
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	pp512	16.66 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	pp512	16.30 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	pp512	23.31 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	pp512	26.52 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	pp512	29.73 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	pp512	23.91 ± 0.05
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	pp512	16.73 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	pp512	12.62 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	pp512	23.88 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	pp512	21.59 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	tg128	8.20 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	tg128	9.70 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	tg128	14.48 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	tg128	15.95 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	tg128	19.80 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	tg128	14.89 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	tg128	10.94 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	tg128	8.53 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	tg128	14.38 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	tg128	13.22 ± 0.02

After SIMD Instruction Set

model	size	parameters	backend	threads	test	t/s
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	pp512	85.46 ± 0.09
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	pp512	35.39 ± 0.13
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	pp512	121.46 ± 0.81
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	pp512	123.79 ± 0.40
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	pp512	137.36 ± 0.52
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	pp512	118.88 ± 0.56
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	pp512	111.65 ± 0.38
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	pp512	101.94 ± 0.59
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	pp512	94.28 ± 0.18
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	pp512	99.43 ± 0.87
Granite-3.1-1B-A400M-Instruct-BE-F32	4.97 GiB	1.33 B	BLAS	8	tg128	14.27 ± 0.29
Granite-3.1-1B-A400M-Instruct-BE-F16	2.49 GiB	1.33 B	BLAS	8	tg128	13.97 ± 0.11
Granite-3.1-1B-A400M-Instruct-BE-Q4_0	731.07 MiB	1.33 B	BLAS	8	tg128	69.33 ± 1.41
Granite-3.1-1B-A400M-Instruct-BE-Q4_1	807.57 MiB	1.33 B	BLAS	8	tg128	65.97 ± 1.71
Granite-3.1-1B-A400M-Instruct-BE-Q8_0	1.32 GiB	1.33 B	BLAS	8	tg128	57.82 ± 0.60
Granite-3.1-1B-A400M-Instruct-BE-Q4_K	782.12 MiB	1.33 B	BLAS	8	tg128	72.14 ± 0.70
Granite-3.1-1B-A400M-Instruct-BE-Q5_K	910.37 MiB	1.33 B	BLAS	8	tg128	70.34 ± 0.69
Granite-3.1-1B-A400M-Instruct-BE-Q6_K	1.02 GiB	1.33 B	BLAS	8	tg128	63.45 ± 0.68
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL	737.07 MiB	1.33 B	BLAS	8	tg128	60.09 ± 1.33
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS	700.32 MiB	1.33 B	BLAS	8	tg128	66.48 ± 1.29

[!NOTE] Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

Feb 22 '25 08:02 taronaeo

I have fixed all problems and have re-tested the implementation to ensure that it is working as intended. No problems so far, do let me know how should I proceed with this PR.

Feb 22 '25 09:02 taronaeo

It appears that these failing unit tests point towards not being able to download a model from HuggingFace. In run number 2, the following error was thrown by the server unit test which points directly to the test not being able to download a model for testing.

Run number 2 error details

==================================== ERRORS ====================================
________________ ERROR at setup of test_with_and_without_draft _________________

    @pytest.fixture(scope="module", autouse=True)
    def fixture_create_server():
>       return create_server()

unit/test_speculative.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
unit/test_speculative.py:14: in create_server
    server.model_draft = download_file(MODEL_DRAFT_FILE_URL)
utils.py:410: in download_file
    wget.download(url, out=output_file)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/wget.py:526: in download
    (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:241: in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:216: in urlopen
    return opener.open(url, data, timeout)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:525: in open
    response = meth(req, response)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:634: in http_response
    response = self.parent.error(
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:563: in error
    return self._call_chain(*args)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:496: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.HTTPDefaultErrorHandler object at 0x7f861c9f0f50>
req = <urllib.request.Request object at 0x7f861ca31ed0>
fp = <http.client.HTTPResponse object at 0x7f861ca2ab90>, code = 504
msg = 'Gateway Time-out'
hdrs = <http.client.HTTPMessage object at 0x7f861ca32c10>

    def http_error_default(self, req, fp, code, msg, hdrs):
>       raise HTTPError(req.full_url, code, msg, hdrs, fp)
E       urllib.error.HTTPError: HTTP Error 504: Gateway Time-out

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:643: HTTPError
---------------------------- Captured stdout setup -----------------------------
Downloading https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf to ./tmp/stories15M-q4_0.gguf
=========================== short test summary info ============================
ERROR unit/test_speculative.py::test_with_and_without_draft - urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
====== 122 passed, 3 skipped, 108 deselected, 1 error in 94.14s (0:01:34) ======
Error: Process completed with exit code 1.

In run number 3, the following error was thrown by server-windows unit test and it appears to be the same problem where it is unable to download a model for testing.

Run number 3 error details

  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
100  3035  100  3035    0     0    300      0  0:00:10  0:00:10 --:--:--   744

0.10.309.401 E common_download_file: invalid http status code received: 504

0.10.314.217 E common_iniWaiting for server to start...
-------------------------- Captured stdout teardown ---------------------------
Stopping server with pid=6332
=========================== short test summary info ===========================
FAILED unit/test_basic.py::test_server_start_simple - TimeoutError: Server did not start within 12 seconds
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 108 deselected in 12.99s ======================
Error: Process completed with exit code 1.

Both of which have the HuggingFace server return a 504 status code. I believe this does not have any relation to my code unless I am missing something here.

Do let me know how this PR can proceed with these sporadic errors occurring on the unit tests.

Feb 22 '25 15:02 taronaeo

It appears that these failing unit tests point towards not being able to download a model from HuggingFace.

Yes, these runs fail from time to time for some reason - not related to this PR.

Feb 22 '25 15:02 ggerganov