Mark O'Connor issues

Results 10 issues of


                                            Mark O'Connor

Txt format Google Drive file is truncated and incomplete

At least one of the files in 68MB file at https://drive.google.com/file/d/1WYfgr31T-PPwMcxuAq09XZfHQO5Mw8fE/view?usp=sharing is truncated in the middle of the list of intervals with several lines missing and no closing "

Tensor deallocate unusably slow (0.5ms per call)

Mistral attention (branch: mistral-fast-attention d3b355a6a6e19827517deb5fdb3a91c03f079ea6) has very slow tensor deallocate calls. To reproduce on the above checkout: `pytest models/demos/mistral7b/tests/test_mistral_attention.py` ``` ... Test | INFO | Small tensor algorithm selected 22...

bug

models

mistral

P1_critical

LLMs on Metal

perf

Replace FreeList shared_ptr with local_shared_ptr

Using a shared_ptr in FreeList is unreasonably slow because it uses atomic operations for all the reference counts and these get hit on every time while iterating through the list....

use_program_cache fixture silently fails in combination with t3k_device_mesh

The `use_program_cache` pytest fixture only enables the program cache if combined with `device` or `all_devices` but silently does nothing when used with `t3k_device_mesh`. This was... interesting to track down.

metal

models

mixtral

LLMs on Metal

Reshape -> Transpose gives bad PCC

**Describe the bug** ``` tt_input = tt_input.reshape(1, 2048, 4, 128) tt_output= ttnn.transpose(tt_input, 1, 2) ``` gives 0.0 PCC compared to Torch: ``` torch_ref = torch_input.view(1, 2048, 4, 128) torch_ref =...

bug

LLM_bug

Op Generalization

llama3

Transpose fails for unaligned shapes

**Describe the bug** WH transpose fails if W is an unaligned value such as 5. **To Reproduce** Can be trivially reproduced by adding `[[1, 1024, 5, 1280]], # Non page-aligned`...

bug

P2_should_have

LLM_bug

Op Generalization

llama3

Large DRAM-sharded matmuls reliably hang on wormhole

**Describe the bug** Large DRAM-sharded matmuls cause wormhole to hang after a few iterations. Reducing the number of columns in the weight matrix to below 32k seems to work around...

bug

models

LLM_bug

llama3

tracy profiler

Mark O'Connor

Txt format Google Drive file is truncated and incomplete

Tensor deallocate unusably slow (0.5ms per call)

Replace FreeList shared_ptr with local_shared_ptr

use_program_cache fixture silently fails in combination with t3k_device_mesh

Reshape -> Transpose gives bad PCC

Transpose fails for unaligned shapes

Large DRAM-sharded matmuls reliably hang on wormhole

Fix Llama rope scaling factor, improve accuracy

Qwen3 dense model support

Tracy misses devices when logging the prefill variant of PagedUpdateCache