Mark O'Connor

Results 28 comments of Mark O'Connor

The Meta Llama 3 huggingface repo config.json was updated from 8 to 32 a couple of weeks after release: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/commit/c4219cc9e642e492fd0219283fa3c674804bb8ed ![Image](https://github.com/user-attachments/assets/76db48ce-51e9-4672-b743-eb8d63e5a947) It appears this was a bugfix and they did...

[Notified the Meta repo](https://github.com/meta-llama/llama-models/issues/241) so they can fix this on their end too.

These are the two files showing the issue: [ops_perf_results_2025_11_24_09_41_18.csv.gz](https://github.com/user-attachments/files/23715194/ops_perf_results_2025_11_24_09_41_18.csv.gz) [ops_perf_results_2025_11_24_09_41_07.csv.gz](https://github.com/user-attachments/files/23715193/ops_perf_results_2025_11_24_09_41_07.csv.gz)

Interesting, thanks for the quick and thorough investigation Mo 🙏 So the TL;DR is that the PagedUpdateCacheOp is filtered _on the host side_ and breaks the otherwise solid metal assumption...

Seems like the only _correct_ way of using these ops is to stop passing mesh_coords and instead to mask out the page table with -1 for places that shouldn't be...

From @mtairum: > The way we're doing batch prefill is basically by concatenating the inputs over the sequence length. > So for 128 seqlen and batch-32 (currently what we support...

@jvegaTT I'm taking over our side of this while @pprajapatiTT is away for the week, please let me know as soon as you find something out.

I believe this is because broadcasting is not supported, we also see this bringing up llama 3.2 with the following test failing for all elementwise ops: ``` @pytest.mark.parametrize( "dtype", [...