bark.cpp fix: Metal backend

This PR allows users to use the Metal (MacOS) and cuBLAS backend by:

[x] Exposing the n_gpu_layers parameter in the CLI
[ ] Using the Metal backend in the forward pass

Apr 16 '24 21:04 PABannier

After it creates the tokens and runs ggml_metal_init, I get this:

ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 21845.34 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =    54.36 MB, (   54.98 / 21845.34)
encodec_load_model_weights: model size =    44.36 MB
encodec_load_model: n_q = 32
ggml_metal_add_buffer: allocated 'backend         ' buffer, size =   314.06 MB, (  369.05 / 21845.34)
encodec_eval: compute buffer size: 314.05 MB

ggml_metal_graph_compute_block_invoke: error: node   0, op =   REPEAT not implemented
GGML_ASSERT: /Users/siraben/Git/bark.cpp/encodec.cpp/ggml/src/ggml-metal.m:1428: false
ggml_metal_graph_compute_block_invoke: error: node 4677, op = MAP_CUSTOM2_F32 not implemented
[1]    9701 abort      ./examples/main/main -ngl 100 -t 8 -m ./ggml_weights/ggml_weights.bin -em  -p

Apr 19 '24 17:04 siraben

Hello @siraben ! Indeed, it seems that some operations (e.g., repeat, which is used to broadcast computations) do not have a corresponding Metal kernel implemented in ggml. I'll open a PR to implement them.

Apr 20 '24 13:04 PABannier

When I try to run cmake -DGGML_CUBLAS=ON .. I get:

CMake Warning at encodec.cpp/ggml/src/CMakeLists.txt:219 (message):
  cuBLAS not found

Apr 23 '24 23:04 normatovjj

When I try to run cmake -DGGML_CUBLAS=ON .. I get:
CMake Warning at encodec.cpp/ggml/src/CMakeLists.txt:219 (message):
  cuBLAS not found

I also tried CMAKE_ARGS='-DLLAMA_CUBLAS=on' cmake .. and added all the changes proposed in this pull, but to no success.

Apr 26 '24 02:04 normatovjj