llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: Add TPU/Hardware Accelerator Support (e.g., Google Coral, Hailo) to llama.cpp

Open FixeQD opened this issue 1 year ago • 1 comments

Prerequisites

  • [x] I am running the latest code. Mention the version if possible as well.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I propose adding hardware acceleration support for AI-focused chips like TPUs (e.g., Google Coral) and Hailo to llama.cpp. This would allow users to leverage dedicated AI accelerators for faster inference of LLMs (e.g., LLaMA) on edge devices like Raspberry Pi or low-power setups.

Motivation

  • Current Limitation: llama.cpp relies heavily on CPU/GPU, which limits performance on resource-constrained devices.
  • TPUs and Hailo: These accelerators are designed for efficient tensor operations and could drastically reduce inference latency/power consumption.
  • Community Impact: Many developers use devices like Raspberry Pi with TPU/Hailo add-ons – this integration would unlock new use cases.

Possible Implementation

1. Google Coral (Edge TPU) Integration

  • Libraries: Use libedgetpu (GitHub), Google's open-source library for interacting with Edge TPUs.
  • Model Conversion:
    • Convert GGUF/GGML models to TensorFlow Lite format using existing tools in llama.cpp.
    • Compile TFLite models for TPU compatibility using the edgetpu_compiler tool.
  • Inference Workflow:
    • Offload matrix operations (e.g., tensor contractions) to the TPU via libedgetpu APIs.
    • Implement TPU-specific quantization (e.g., int8) to maximize performance.

2. Hailo Integration

  • Libraries: Leverage hailort (GitHub), Hailo's runtime library for deploying models on Hailo accelerators.
  • Model Conversion:
    • Convert models to Hailo's native HEF format using the Hailo Dataflow Compiler.
    • Use intermediate formats like ONNX for compatibility with Hailo's tools.
  • Inference Workflow:
    • Load HEF models via hailort and manage inference pipelines for low-latency execution.
    • Optimize model layers using Hailo's profiling tools to balance compute between CPU and Hailo.

3. Unified Hardware Abstraction

  • Design a modular backend system in llama.cpp to support multiple accelerators (TPU, Hailo, GPU).
  • Add configuration flags (e.g., --tpu, --hailo) to let users select the accelerator at runtime.
  • Provide clear error handling for unsupported operations (e.g., fallback to CPU).

4. Cross-Platform Support

  • Raspberry Pi: Document driver installation and library dependencies for both Coral TPU and Hailo.
  • Quantization Tools: Extend llama.cpp's quantization scripts to generate accelerator-optimized models (e.g., TPU-int8, Hailo-16bit).

Use Case Examples

  • Raspberry Pi + Hailo-8L: Local AI chatbot with real-time response.
  • Google Coral + LLaMA-7B: Energy-efficient inference for IoT devices.

Testing Availability

I will soon acquire the Raspberry Pi AI Kit with Hailo-8L and can act as a tester for the Hailo integration. I should be able to start testing within a few weeks. My setup will include a Raspberry Pi 5 with 8 GB (or even 16 GB) RAM, and I plan to test models like LLaMA and DeepSeek for tasks such as text generation and chatbot applications.

FixeQD avatar Feb 02 '25 20:02 FixeQD

Yes, it would be very nice to add support for TPU accelerators

https://coral.ai/docs/edgetpu/tflite-cpp/

fedecompa avatar Feb 17 '25 09:02 fedecompa

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 03 '25 01:04 github-actions[bot]

Bruh

FixeQD avatar Apr 03 '25 04:04 FixeQD

Was this done elsewhere? Why was this marked stale? Can we reopen this?

TheInfamousAlk avatar Jul 25 '25 01:07 TheInfamousAlk

Was this done elsewhere? Why was this marked stale? Can we reopen this?

Idk why

FixeQD avatar Jul 25 '25 11:07 FixeQD

It's a bot that closes issues after inactivity. Generally, there is no point in keeping stale issues open. Closing an issue does not mean it's of no interest - it simply means there is no progress.

ggerganov avatar Jul 25 '25 11:07 ggerganov

Honestly, i doubt it's actually possible to run an LLM on the Hailo-8L, it's main purpose is working with computer vision projects using the raspberrypi camera, it's creators wrote several times that it isn't intended for LLMs.

But who knows, maybe it's still somehow possible to load the model on system ram and use it.

DGdev91 avatar Aug 07 '25 10:08 DGdev91

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Sep 21 '25 01:09 github-actions[bot]

Not stale

TheInfamousAlk avatar Sep 21 '25 12:09 TheInfamousAlk

Hi Not only for Raspberrypi, they now target datacenter with their PCI version 😉 https://revinetech.com/product/73/hailo-8-century-high-performance-pcie-card

But they also say (in hailo forum) :

... Our current Gen chips, Hailo8, are for computer vision based AI. Our next generation of chips, Hailo10H, will be for running LLMs ...

Cyrille37 avatar Nov 14 '25 18:11 Cyrille37

So nobody wants to try? https://github.com/ollama/ollama/issues/990

kappa8219 avatar Dec 15 '25 10:12 kappa8219

  1. Funny AI slopped first post - if everything that was that easy, it would have been done long time ago.
  2. On a more serious note, the listed AI accelerators (Coral TPU and Hailo 8) are just not suitable for LLM acceleration. They are created for much smaller ConvNets and image processing. I think there were attempts to run whisper on Hailo 8, but I don't think it went anywhere.
  3. There are some accelerators that support running LLMs, but these are mostly built in the SoC, see Rockhip RK3588 and some others and rknn-llm for example. Other notable mentions include certain Qualcomm chips and Genie, AMD Ryzen AI Pro CPUs, perhaps some others.
  4. From Hailo, as it was mentioned above 10H is useful for LLM inference. Maybe there will be some consumer hardware that will incorporate this module in near future.

AIWintermuteAI avatar Dec 18 '25 14:12 AIWintermuteAI

Indeed.

libedgetpu - This repository was archived by the owner on Oct 14, 2025. It is now read-only.

I wonder why not to try TPUs in google cloud. For now Tensorflow test is core dumping but it may be /dev/hands : https://www.reddit.com/r/googlecloud/comments/1pk82ln/comment/nu8b8ss/?context=1

kappa8219 avatar Dec 18 '25 16:12 kappa8219

Just mentioning the new M5Stack LLM-8850AI M.2 Acceleration Card (AX8850) , which can run LLMs. See: How I Built an INSANELY Fast OFFLINE AI Chatbot (Pi5 + Whisplay + LLM8850)

stevef1uk avatar Jan 10 '26 22:01 stevef1uk

Raspberry PI launched now Raspberry Pi AI HAT+ 2.

The specs are:

Hailo-10H AI accelerator delivering 40 TOPS (INT4) inferencing performance

Performance for computer vision models comparable to the Raspberry Pi AI HAT+ (26 TOPS)

Runs generative AI models efficiently using 8GB on-board RAM

Fully integrated into Raspberry Pi’s camera software stack

At Hailo Model Zoo GenAI we can see some things that would be of interest to llama.cpp users.

cristianadam avatar Jan 15 '26 09:01 cristianadam

Worth noting this benchmark, which found that the RPi5 CPU is already faster than the hat (and less memory-limited, if you have a >8GB Pi5): https://www.jeffgeerling.com/blog/2026/raspberry-pi-ai-hat-2/

lee-b avatar Jan 15 '26 10:01 lee-b