Jiqun Tu

Results 8 issues of Jiqun Tu

~~This PR makes the DWF fused kernels run with NVSHMEM. Running on one node on Selene with 1x2x2x2 and Ls = 12, getting (performance numbers are in GFLOPS)~~ No there...

As the title suggested, we should have an unified interface for creating/accepting preconditioning solvers. Currently with #1061 the create part of the interface is located in `invert_preconditioner.h`.

feature
clean-up

Improvements to split grid in the future: - Add support for split grid + multi-shift. It should be straight forward. - Add support for split grid when the number of...

feature
clean-up

Add an `instantiate` item for `copy_gauge_field` and `copy_gauge_field_offset` for the gauge orders, etc. One trick thing is that with the lists in `instantiate.h` it becomes hard to know which file...

clean-up

Currently `trove` uses 1-d thread index, i.e. it uses `threadIdx.x` instead of `(threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x`. We should make sure 3-d thread index is used...

clean-up

The `copy__buffer` methods of the various field types assumes the buffer is on the host - this forbids one from doing split and join fields from device buffers when GPU...

optimization

One could cache the collected gauge and clover fields and reuse the previously generated fields when doing split grid. The starting point should probably be - https://github.com/lattice/quda/blob/ed21580eabd7dd8bfebee40a65ab813af1453f95/lib/interface_quda.cpp#L3180 - https://github.com/lattice/quda/blob/ed21580eabd7dd8bfebee40a65ab813af1453f95/lib/interface_quda.cpp#L3202 In...

optimization

MMA-izing the prolongator and restrictor kernels.