torchchat
torchchat copied to clipboard
Run PyTorch LLMs locally on servers, desktop and mobile
This PR is to add max-autotune for CPU in torch.compile. Meanwhile, split first token and next token in the log print.
This PR aims to support flamingo component, including model component, input preprocessing, pipeline update, etc.
### 🐛 Describe the bug Eval is very slow for PTE models vs. non-exported models - the opposite should be true and can be observed in generate. I suspect this...
### 🚀 The feature, motivation and pitch First surfaced in https://github.com/pytorch/torchchat/pull/1057, the `replace_attention_with_custom_sdpa_attention` function, used when exporting models in torchchat, can be replaced with the equivalent API provided in the...
### 🐛 Describe the bug Instructions for running the API are collapsed by default and the instructions for the browser don't call out well that the API needs to be...
Minor QoL change; push the formatting of text string prompts into the helper --- ## Testing Tested via browser: ``` python torchchat.py server llama3.2-11b streamlit run torchchat/usages/browser.py ``` Tested in...
### 🚀 The feature, motivation and pitch Can torchat pick up the models that have already been downloaded by Ollama. Is there a way to use them without downloading them...
### 🐛 Describe the bug https://github.com/pytorch/torchchat/blob/main/install/requirements.txt#L15 https://github.com/pytorch/pytorch/blame/main/requirements.txt#L5 this complicates (I would say "prohibits", but there's probably a way) running torchchat with a locally-built pytorch. ### Versions internal devserver, python 3.12
Implement the AO API in torchchat quantization handlers and unify logic. 1 - implement .quantize() for TC quantization handlers and support args to make consistent with AO 2 - remove...