Piotr Nawrot
Piotr Nawrot
We have released [nanoT5](https://github.com/PiotrNawrot/nanoT5) for pre-training and evaluating T5-style (Encoder-Decoder) models. You can use it to pre-train your own model in one day on a single GPU :).
We've released [nanoT5](https://github.com/PiotrNawrot/nanoT5) that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax). You can take a look! Any suggestions are more than welcome.
Other relevant paper about hierarchical processing in Transformer decoder models would be [this one](https://arxiv.org/pdf/2211.09761.pdf)
+1, I'm getting exactly the same results
Hey @Kyriection - Thanks a lot for your response and extra clarification. I'm having one more issue with reproducing Figure 8 from the latest version of the paper. I followed...
Moreover I'm also having issues with reproducing Table 2 results from the paper for OPT-30B. Again I believe that I'm strictly following the commands from the README. It would be...
> "and for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage" Did you use scores from prefilling stage for any of the downstream...
Yes, I understand - is this logic implemented somewhere in the code? Also, do you have any idea what could be the reason behind my suboptimal results?
This is still an issue for me as well!