Prashant Tandon
Prashant Tandon
I've encountered the same error, tried it on linux as well as windows
@hkproj I think the implementation deviates from the architecture proposed in the paper. The paper states that normalization is applied after each sublayer i.e. there is the output of the...
@dhantule @laxmareddyp I'd like to work on adding Magistral.
> [@dhantule](https://github.com/dhantule) [@laxmareddyp](https://github.com/laxmareddyp) I'd like to work on adding Magistral. Please confirm if these are the appropriate references - Paper link - [Magistral](https://arxiv.org/pdf/2506.10910) - HF link for the model - ...