ggml [Idea]: Use Android NNAPI to accelerate inference on Android Devices

This is just an idea for you. Most modern smartphones come with some form of AI accelerator. I am aware GGML-based projects like llama.cpp can compile and run on mobile devices, but there is probably performance left on the table. I think there is right now a gap for an mobile-optimized AI inference library with quantization support and the other tricks present in GGML. For reference: https://developer.android.com/ndk/guides/neuralnetworks

Apr 16 '23 13:04 Interpause

Would love to see this as well!

Nov 15 '23 03:11 Saghetti0

If there is community help, we can try to add support for NNAPI. Currently, I don't have enough capacity to investigate this, but I think it is something interesting and can unlock many applications. Probably will look into this in the future and hoping there are some contributions in the meantime

Nov 19 '23 16:11 ggerganov

im trying to write a nnapi backend (well, you should not expect my work. because im a completely newbie. and mostly wont have any success). but after some document reading. i found that unlike cl or vk, nnapi didn't provide a way to use accelerated matrix multiply or some shader like stuff to compute something in gpu. the only things you can do with it is upload a graph of how layers connected (include operand and weight). so seems like it not very match the architecture llama.cpp current have? if no, please point me a backend using similar architecture so that i can have reference

Nov 27 '23 17:11 rhjdvsgsgks

@ggerganov maybe it's worth checking NNAPI using ONNX runtime? WhisperRN runs smooth with CoreML, but on Android, even the tiny model is way too laggy to be usable on a budget device (for example Samsung a14, 4 GB RAM)

Mar 27 '24 15:03 pax-k