llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml-cpu: Support s390x SIMD Instruction Set

Open taronaeo opened this issue 11 months ago • 1 comments

This pull request aims to integrate the SIMD instruction set via vecintrin.h into llama.cpp on the s390x platform. Currently the SIMD instruction set is included in the following ggml_vec_dot functions:

Function Implementation Remarks
ggml_vec_dot_f32 IMPLEMENTED Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_f16 IMPLEMENTED Notice a hotspot for Assembly call vector load. Will fix in another PR.
ggml_vec_dot_q4_0_q8_0 IMPLEMENTED
ggml_vec_dot_q4_1_q8_1 IMPLEMENTED
ggml_vec_dot_q8_0_q8_0 IMPLEMENTED
ggml_vec_dot_q4_K_q8_K IMPLEMENTED
ggml_vec_dot_q5_K_q8_K IMPLEMENTED
ggml_vec_dot_q6_K_q8_K IMPLEMENTED
ggml_vec_dot_iq4_nl_q8_0 IMPLEMENTED
ggml_vec_dot_iq4_xs_q8_K IMPLEMENTED

Verification

To ensure that this implementation did not break anything, the SIMD instruction set has been tested on the following models:

  • Tested IBM Granite 3.0 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
  • Tested IBM Granite 3.1 (F32, F16, Q4_0, Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS)
  • Kindly request additional models for testing in this PR

Performance Results

I will be using IBM Granite 3.1 for the performance results as it has better neural network than 3.0.

Before SIMD Instruction Set

model size parameters backend threads test t/s
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 pp512 16.66 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 pp512 16.30 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 pp512 23.31 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 pp512 26.52 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 pp512 29.73 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 pp512 23.91 ± 0.05
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 pp512 16.73 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 pp512 12.62 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 pp512 23.88 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 pp512 21.59 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 tg128 8.20 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 tg128 9.70 ± 0.01
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 tg128 14.48 ± 0.03
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 tg128 15.95 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 tg128 19.80 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 tg128 14.89 ± 0.06
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 tg128 10.94 ± 0.04
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 tg128 8.53 ± 0.02
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 tg128 14.38 ± 0.07
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 tg128 13.22 ± 0.02

After SIMD Instruction Set

model size parameters backend threads test t/s
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 pp512 85.46 ± 0.09
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 pp512 35.39 ± 0.13
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 pp512 121.46 ± 0.81
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 pp512 123.79 ± 0.40
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 pp512 137.36 ± 0.52
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 pp512 118.88 ± 0.56
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 pp512 111.65 ± 0.38
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 pp512 101.94 ± 0.59
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 pp512 94.28 ± 0.18
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 pp512 99.43 ± 0.87
Granite-3.1-1B-A400M-Instruct-BE-F32 4.97 GiB 1.33 B BLAS 8 tg128 14.27 ± 0.29
Granite-3.1-1B-A400M-Instruct-BE-F16 2.49 GiB 1.33 B BLAS 8 tg128 13.97 ± 0.11
Granite-3.1-1B-A400M-Instruct-BE-Q4_0 731.07 MiB 1.33 B BLAS 8 tg128 69.33 ± 1.41
Granite-3.1-1B-A400M-Instruct-BE-Q4_1 807.57 MiB 1.33 B BLAS 8 tg128 65.97 ± 1.71
Granite-3.1-1B-A400M-Instruct-BE-Q8_0 1.32 GiB 1.33 B BLAS 8 tg128 57.82 ± 0.60
Granite-3.1-1B-A400M-Instruct-BE-Q4_K 782.12 MiB 1.33 B BLAS 8 tg128 72.14 ± 0.70
Granite-3.1-1B-A400M-Instruct-BE-Q5_K 910.37 MiB 1.33 B BLAS 8 tg128 70.34 ± 0.69
Granite-3.1-1B-A400M-Instruct-BE-Q6_K 1.02 GiB 1.33 B BLAS 8 tg128 63.45 ± 0.68
Granite-3.1-1B-A400M-Instruct-BE-IQ4_NL 737.07 MiB 1.33 B BLAS 8 tg128 60.09 ± 1.33
Granite-3.1-1B-A400M-Instruct-BE-IQ4_XS 700.32 MiB 1.33 B BLAS 8 tg128 66.48 ± 1.29

[!NOTE] Tests were conducted on an IBM z15 Mainframe with 8 IFLs (cores) and 64 GB Memory on an LPAR.

Please review this pull request and consider merging into the main repository. Thank you!

taronaeo avatar Feb 22 '25 08:02 taronaeo

I have fixed all problems and have re-tested the implementation to ensure that it is working as intended. No problems so far, do let me know how should I proceed with this PR.

taronaeo avatar Feb 22 '25 09:02 taronaeo

It appears that these failing unit tests point towards not being able to download a model from HuggingFace. In run number 2, the following error was thrown by the server unit test which points directly to the test not being able to download a model for testing.

Run number 2 error details
==================================== ERRORS ====================================
________________ ERROR at setup of test_with_and_without_draft _________________

    @pytest.fixture(scope="module", autouse=True)
    def fixture_create_server():
>       return create_server()

unit/test_speculative.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
unit/test_speculative.py:14: in create_server
    server.model_draft = download_file(MODEL_DRAFT_FILE_URL)
utils.py:410: in download_file
    wget.download(url, out=output_file)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/wget.py:526: in download
    (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:241: in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:216: in urlopen
    return opener.open(url, data, timeout)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:525: in open
    response = meth(req, response)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:634: in http_response
    response = self.parent.error(
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:563: in error
    return self._call_chain(*args)
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:496: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.HTTPDefaultErrorHandler object at 0x7f861c9f0f50>
req = <urllib.request.Request object at 0x7f861ca31ed0>
fp = <http.client.HTTPResponse object at 0x7f861ca2ab90>, code = 504
msg = 'Gateway Time-out'
hdrs = <http.client.HTTPMessage object at 0x7f861ca32c10>

    def http_error_default(self, req, fp, code, msg, hdrs):
>       raise HTTPError(req.full_url, code, msg, hdrs, fp)
E       urllib.error.HTTPError: HTTP Error 504: Gateway Time-out

/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/urllib/request.py:643: HTTPError
---------------------------- Captured stdout setup -----------------------------
Downloading https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf to ./tmp/stories15M-q4_0.gguf
=========================== short test summary info ============================
ERROR unit/test_speculative.py::test_with_and_without_draft - urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
====== 122 passed, 3 skipped, 108 deselected, 1 error in 94.14s (0:01:34) ======
Error: Process completed with exit code 1.

In run number 3, the following error was thrown by server-windows unit test and it appears to be the same problem where it is unable to download a model for testing.

Run number 3 error details
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
100  3035  100  3035    0     0    300      0  0:00:10  0:00:10 --:--:--   744

0.10.309.401 E common_download_file: invalid http status code received: 504

0.10.314.217 E common_iniWaiting for server to start...
-------------------------- Captured stdout teardown ---------------------------
Stopping server with pid=6332
=========================== short test summary info ===========================
FAILED unit/test_basic.py::test_server_start_simple - TimeoutError: Server did not start within 12 seconds
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 108 deselected in 12.99s ======================
Error: Process completed with exit code 1.

Both of which have the HuggingFace server return a 504 status code. I believe this does not have any relation to my code unless I am missing something here.

Do let me know how this PR can proceed with these sporadic errors occurring on the unit tests.

taronaeo avatar Feb 22 '25 15:02 taronaeo

It appears that these failing unit tests point towards not being able to download a model from HuggingFace.

Yes, these runs fail from time to time for some reason - not related to this PR.

ggerganov avatar Feb 22 '25 15:02 ggerganov