Mark O'Connor
Mark O'Connor
The Meta Llama 3 huggingface repo config.json was updated from 8 to 32 a couple of weeks after release: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/commit/c4219cc9e642e492fd0219283fa3c674804bb8ed  It appears this was a bugfix and they did...
[Notified the Meta repo](https://github.com/meta-llama/llama-models/issues/241) so they can fix this on their end too.
These are the two files showing the issue: [ops_perf_results_2025_11_24_09_41_18.csv.gz](https://github.com/user-attachments/files/23715194/ops_perf_results_2025_11_24_09_41_18.csv.gz) [ops_perf_results_2025_11_24_09_41_07.csv.gz](https://github.com/user-attachments/files/23715193/ops_perf_results_2025_11_24_09_41_07.csv.gz)
Interesting, thanks for the quick and thorough investigation Mo 🙏 So the TL;DR is that the PagedUpdateCacheOp is filtered _on the host side_ and breaks the otherwise solid metal assumption...
Seems like the only _correct_ way of using these ops is to stop passing mesh_coords and instead to mask out the page table with -1 for places that shouldn't be...
From @mtairum: > The way we're doing batch prefill is basically by concatenating the inputs over the sequence length. > So for 128 seqlen and batch-32 (currently what we support...
@jvegaTT I'm taking over our side of this while @pprajapatiTT is away for the week, please let me know as soon as you find something out.
I believe this is because broadcasting is not supported, we also see this bringing up llama 3.2 with the following test failing for all elementwise ops: ``` @pytest.mark.parametrize( "dtype", [...