Andy Lo
Andy Lo
Just want to mention that cuBLAS (via the newer `cuBLASLt` API) does offer an interface that fuses matmul with bias addition: [`cublasLtMatmul()`](https://docs.nvidia.com/cuda/cublas/#cublasltmatmul) which computes `D = A @ B +...
It is really hard to use CUTLASS due to the large (nested) template classes which has poor IDE support (e.g. autocompletion). [C++ 20 concepts](https://en.cppreference.com/w/cpp/language/constraints) is meant to be a solution...
Some of the namings of the B-operand functions were directly copied from the A-operand counterpart, fixed the naming of the variables and comments to improve clarity.
Equation 6 & 7 from the paper suggests that the scores are computed from $\hat{x}\_{t\_i}$ (**not** $\hat{x}'\_{t\_i}$).  However, in the implementation, the update (Eq. 6) is applied to `x`...
https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L248-L259 I think the `names[0] if len(names) == 1 else names[-1]` expression in L254 is just redundant. Should use just `names[-1]`.