WebGPU Support
GPU Acceleration of transformers is possible, but it is hacky.
Requires an unmerged-pr version of transformers.js that relies on a patched version of onnxruntime-node.
Xenova plans on merging this PR only after onnxruntime has official support for GPU Acceleration. In the meantime, this change could be implemented, potentially as an advanced "experimental" feature.
This is really cool and paves the way for LLMs running in the browser!
I had this idea in my head for a while now: we already have a kind of (primitive) vector DB (just a JSON) and the small model for embeddings. If we added a LLM for Q&A/ text generation based solely on the infos in the text this would be huge!
I already talked to the folks from Qdrant on their discord server if they'd be interested in providing a JS/webassembly version of their Rust-based vector DB (as they developed plenty of optimizations) but for the moment they have other priorities. Still, they said they might go for it at some point.
Anyway, I think this would make for an interesting POC to explore this. About the idea to integrate it directly, until it's officially supported, we could maybe detect the web-GPU support automatically and simply load the right version? Or does the web-GPU version also support CPU?
P.S. There would be so much fun in it for NLP with LLMs if for example we'd created an image of all leitmotifs in the text or some kind of text summary image or similar for a visual understanding of text...
I am working on a similar effort myself, lets cooperate!
More specifically, I wanted to use this project as a basis for an SDK that allows one to run semantic search on their own website's content.
Sounds great!
It's also on the feature/idea list of the readme.md that this repo could become a browser plugin for FF or Chrome. Of course, it would need a leaner GUI.
I was thinking of some kind of bar integrated on top of a webpage like Algolia / lunr etc. do. Good example: on mkdocs material homepage:
(By the way, I also had ideas for integrating semantic search in mkdocs, but I'm lacking the time atm...)
What about your idea?
(We're kind of drifting away from this issue's topic, let's move to discussions: https://github.com/do-me/SemanticFinder/discussions/15)
We're finally getting closer to WebGPU support: https://github.com/xenova/transformers.js/pull/545 It's already usable in the dev branch. I'm really excited about this as people are reporting accelerations of factor 20x-100x!
In my case (M3 Max) I'm getting a massive inferencing speedup of 32x-46x. See for yourself: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark
Even with cheap integrated graphics (Intel "integrated" GPUs like UHD or Iris Xe) I get a 4x-8x boost. So literally everyone would see massive speed gains!
This is the most notably performance improvement I see atm, hence referencing #49.
I hope that transformers.js will allow for some kind of automatic setting where WebGPU is used if available but else falls back to plain CPU.
Speedup is about 10x for me on an M1. Definitely huge. Not sure how embeddings will compare to inference in terms of GPU optimization but I think there is huge room for parallelization.
Transformers.js and WebGPU
Folks, it's finally here 🥹 https://huggingface.co/posts/Xenova/681836693682285
However, afaik there is no docs for v3 yet. I tried updating SemanticFinder with v3 and running some quick tests, but failed.
-
npm uninstall @xenova/transformersthennpm install @huggingface/transformers - Replace import statements in
semantic.jsandworker.jstoimport { stuff } from '@huggingface/transformers'; - Set a WebGPU compatible model (not sure whether all are compatible by default?) like:
<option selected value="Xenova/all-MiniLM-L12-v2">Xenova/all-MiniLM-L12-v2 | 💾133MB | 66.7MB | 34MB 📥2 ❤️3</option>inindex.html - Change the extractor pipeline and use e.g. like this:
Unfortunately still throws some errors, but I'd say it's better to wait for the official v3 docs. Also it's in alpha at the moment, so errors pretty much expected.
exciting news!
@do-me I think also have to change the quantized:true flag to dtype:"f32" for unquantized or dtype:"f16" or "q8" ...etc for quantized.
await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
device: 'webgpu',
dtype: 'fp32', // or 'fp16'
});
examples https://github.com/xenova/transformers.js/issues/784 also https://github.com/xenova/transformers.js/issues/894
@gdmcdonald thanks for the references. Note that in these examples, they used the "old" packages from @Xenova/transformers, here I tried with the new @huggingface/transformers
On my screenshot above you can see that dtype was set automatically, so apparently that's not the problem.
Rather the problems seems to stem from worker.js:94 An error occurred during model execution: "Error: Session already started"., an error I don't understand as in the code there is only one session created.
We tightly built the core embedding logic around the old version of transformers.js with callbacks etc. so I guess there is some compatibility problem with the new logic or simply a bug in @huggingface/transformers.
When I manage to find some time, I will try with the v3 branch in @Xenova/transformers again. If someone else wants to try and create a PR helping hands are always welcome :)
Ah ok. I was using @huggingface/transformers v3 as well and I ran into the same issue you did
worker.js:94 An error occurred during model execution: "Error: Session already started". I just assumed I had too many webgpu tabs open. Apologies for the spam!
Found a bug with webgpu (wasm works fine): https://github.com/xenova/transformers.js/issues/909
The problem is calling the extractor two consecutive times. The first time works (for the query embedding) but the second time fails (for chunk embeddings).
Folks, it's here! 🥳 I added webgpu support in the new branch and it's fast!
There was a simple problem in the old code where I would call Promise.all() for parallel execution which was nonsense, more detail about this here: https://github.com/xenova/transformers.js/issues/909
I needed to modify this code in https://github.com/do-me/SemanticFinder/commit/f148689e47c4c4c76061d6a538b8e46baa87ab5d. Main changes were in index.js.
It's really fast! On my system it indexes the whole bible in like 3mins with a small model like Xenova/all-MiniLM-L6-v2 when before with wasm it would take like 30-40 mins.
Not all models are supported, so we should go down that rabbit hole and see whether we can somehow filter the models in index.html for the webgpu branch. Also, the newer @huggingface/transformers version starting with v.0.10 have some kind of bug so I needed to hardcode version 0.9.
I was trying to set up a Github action for the new webgpu branch so it would build the webgpu version and push it to gh-pages in a /webgpu dir but somehow there were errors I couldn't follow up on so far. It somehow overwrote the files in the main directory and did not create the /webgpu dir. You can see my old trials in the history. If someone wants to give a hand it would be highly appreciated :)
Anyway, I'm really excited about this change!
Fantastic news! Just played around and it's working well on my M1. Will followup to see if I can help with errors.
Finally managed to come up with the correct GitHub Action.
- You can now find the WebGPU app here: https://do-me.github.io/SemanticFinder/webgpu/
- Normal page uses wasm: https://do-me.github.io/SemanticFinder/
According to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/ you usually get even better speed-ups when processing in batches. At the moment, the naive logic in SemanticFinder just processes one single chunk at a time which might cause a major bottleneck. Will look into this.
@do-me can you tell me how to update the github action as well for my fork of semantic-finder? ty
It's easy, you simply didn't clone/check out the webgpu branch yet (if I see correctly). If you add the branch to your repo it will work.
According to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/ you usually get even better speed-ups when processing in batches. At the moment, the naive logic in SemanticFinder just processes one single chunk at a time which might cause a major bottleneck. Will look into this.
Batch size changes everything. It gives me insane speed-ups of more than factor 20x
I created a small app based on one of the first versions of SemanticFinder for testing the batch size. In my tests, a chunk size of around 180 chunks per extractor() (inference) call gives me best results.
Play with it here: https://geo.rocks/semanticfinder-webgpu/. See also: https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark/discussions/103
The current logic in SemanticFinder is more complex than this minimal app, so it takes more time to update everything. Could use a hand here as I probably won't find time until next week.
Will look into adding it if I get a chance this week.
As WebGPU has been supported for quite a while now under https://do-me.github.io/SemanticFinder/webgpu/ I'll close this issue. My considerations about batch processing should probably best be continued in https://github.com/do-me/SemanticFinder/issues/49