hecmay
hecmay
> > I suppose that should only work for one-read-one-write case. We can support this specific one-read-one-write case without generating any local buffer, the local buffer is generated for other...
> > > > Agreed. the abstraction is not clear enough, and it's too much work for users to use `.to` for each of the arguments. > > > >...
@chhzh123 Just had a discussion with Sean. I will create a `ZeroCopy` mode for `.to` primitive, so that you will be able to generate kernel function without any read/write nested...
After I add the aforementioned features, I will create a simple primitive for compute placement. This would make our life easier: we do not have to call so many `.to`...
The basic partitioning is working. We have around 5 test cases for that. Yeah, we can add more test cases to ensure that is robust enough first.
Hi jessewjx, 1. The version you are using does not really work on large matrix. For 1024x1024 GEMM, It tries to implement a systolic array with 1024x1024 PEs, which is...
Yeah. You can parameterize the algorithm as matrix A (4096x4096) multiplied with B (4096x1).
Hi @jessewjx. Please see the systolic gemm here: https://github.com/Hecmay/heterocl/blob/fix/samples/systolic_array/systolic_array_module_stream.py It may take a long time for this PR to be merged. You can simply pull back and compile the `fix`...
Bump this up since we are running into the issue recently. The building and simulation process can consume surprisingly long time with increasing number of compute units (CU) in the...
Here is the initial attempt of `.to()` revamp: https://github.com/Hecmay/heterocl/tree/fix