mzchtx
mzchtx
Inspired by the paper Mixmatch, mixup can be used in supervised data. In this way, we can achieve improved performance, even better than the native UDA. 
We found that the way of calculating coordinate mapping in CV-CUDA's resize is different from that of OpenCV (as shown in the pseudo-code in the figure below): - [OpenCV uses...
I think we can slice k, v and mask before calling `F.scaled_dot_product_attention()` to reduce the calculation, otherwise the calculation is the same as max_seq_len even when input_pos is relatively small...
I think we can use index_copy_` the inplace version of index_copy` to reduce the extra cache creation and copying https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/model.py#L217-L218 