mattepiu
mattepiu
Correct me if I'm wrong, but actual multimodal opensource models are essentially just like usual llm plus accepting images as input. If so, keeping in mind it should be modular...
This is what gpt-5 reported of the python code inspection of kda commit into flash-linear-attention repo (maybe useful) and what kernel it produced when asked (likely bugged): https://chatgpt.com/share/69088b9d-7260-800f-abe6-e0efc26baf4d
You might want to see this gist on how to setup torch device using torch_xla ( cuda /TPU/ GPU): https://gist.github.com/ronaldseoh/da4afaa1bb9eb34d32d167ba417a5199 once obtained the right torch device, it should be 1:1...
Checking https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp there's some cpp kernel code which might help more than python: https://github.com/deepseek-ai/DeepGEMM/pull/200/commits/7c95b14aa4a66edd7b682e5acdde62351ca81197 https://github.com/deepseek-ai/FlashMLA/pull/98/commits/c28eca99dbc664dd2716415ed03492afe5fefade
Yes, it seems context oriented. Tested with voice cloning on "vi fanno paura i suoi occhi chiusi che non vi possono più vedere quelle sue mani dure gelide che non...