Alex Cheema
Alex Cheema
The main thing I want to address and test is device support. We can make this the default inference engine if it works reliably across many devices. On that point,...
Hey @risingsunomi I'm thinking of making this the default inference engine on linux machines. Could you resolve conflicts please?
torch not added as a dependency
``` error loading and splitting model: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate` Error processing prompt: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate` Traceback...
seems to use some other downloader (perhaps transformers?) it should use the exo downloader for integration with exo (also these other downloads aren't necessarily async friendly but the exo one...
> > seems to use some other downloader (perhaps transformers?) it should use the exo downloader for integration with exo (also these other downloads aren't necessarily async friendly but the...
It generates! Looks like some tokenizer issue. It never stops generating.
> > > > > > seems to use some other downloader (perhaps transformers?) it should use the exo downloader for integration with exo (also these other downloads aren't necessarily...
Another issue (can be fixed last as this is a tricky one). We need to ensure that the torch operations are not blocking operations. This means the blocking parts need...
> > > > It generates! Looks like some tokenizer issue. It never stops generating. > > Which model is this tested with? Will test more `llama-3.1-8b` This command: `exo...