Mark O'Connor comments

Results 28 comments of


                                            Mark O'Connor

Tensor deallocate unusably slow (0.5ms per call)

@arakhmati The paths are correct for CI bt not for dev clusters on aus and yyz - these are found in /proj_sw/ and you can update the default path with...

Tensor deallocate unusably slow (0.5ms per call)

I appear to have the same issue with Mixtral; looking at a tracy dump it's a mess of Tensor::deallocate calls:

Tensor deallocate unusably slow (0.5ms per call)

That's over a minute (not a second!) in deallocate calls:

Tensor deallocate unusably slow (0.5ms per call)

There are over 3M of them made in this run which had a handful of inferences over a single layer:

Tensor deallocate unusably slow (0.5ms per call)

Using a `shared_ptr` for FreeList isn't great: - Iterating through it (for those lovely long linear-time searches) copies the object N times, causing 2N _atomic_ reference count updates. - Atomic...

Large DRAM-sharded matmuls reliably hang on wormhole

Not sure why GitHub won't allow me to attach a python file, here is `dram_sharded_hang.py`: ``` import math import torch import ttnn import pytest from tqdm import tqdm @pytest.mark.timeout(300) def...

Large DRAM-sharded matmuls reliably hang on wormhole

For context: this is Llama 3's LM head matmul. We have to split it into two because otherwise there's not enough L1 on the 12 DRAM-sharded cores to handle columns...

Large DRAM-sharded matmuls reliably hang on wormhole

Reducing the number of columns in the weight matrix to below 32k seems to work around the issue, so we break it up into N smaller matmuls and execute them...

Large DRAM-sharded matmuls reliably hang on wormhole

Why was this (NOC_MAX_TRANSACTION_ID_COUNT+1)/2 in the first place? Why not wait until less than the max? Performance reasons?

Large DRAM-sharded matmuls reliably hang on wormhole

@yugaoTT Our models contain a workaround for this issue that you can disable to test. Edit `models/demos/llama3/tt/lm_head.py` and change `max_columns_per_device=128256# // 4, # larger values per device lead to OOM...