Marcel
Marcel
[nn.functional.scaled_dot_product_attention](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) is a very efficient implementation of attention. It is way faster and a lot more memory efficient than using the naive implementation and shouldn't require any new dependencies or...
The implementation of Mamba-2 mentions a file named param_grouping.py, but it cannot be found in this repository. https://github.com/state-spaces/mamba/blob/0b5a3e04e2626f5d0caf3cb669a7f7f70ab502bb/mamba_ssm/modules/mamba2.py#L128-L130 https://github.com/state-spaces/mamba/blob/0b5a3e04e2626f5d0caf3cb669a7f7f70ab502bb/mamba_ssm/modules/mamba2_simple.py#L102-L104 There is also a `_no_weight_decay` attribute that is being set, which...
This pull request has been automatically generated by prose.io.