Add checks to confirm that the checkpoint conversion script works perfectly correct
We now have a script that convert megatron-deepspeed checkpoints to HF-transformers checkpoints. Project is here and the script is here. However, the script doesn't have unit tests that confirm that the conversion is correct.
The goal of this issue is to add such tests. The idea is to run the forward pass of both models (before and after conversion) with any random input, then use torch.allclose to assert that the output loss and the logits of both models perfectly match.
Here's a megatron-deepspeed checkpoint and here's the corresponding HF-transformer checkpoint. We just need to verify that these two are the same.
git clone https://huggingface.co/bigscience/gpt2-350m-en/tree/megatron-deepspeed is failing. It says repo not found. I can download the HF version without issue.
@stas00?
The correct syntax is:
git clone --single-branch --branch megatron-deepspeed https://huggingface.co/bigscience/gpt2-350m-en
reference: https://stackoverflow.com/a/1911126/9201239
or:
git clone https://huggingface.co/bigscience/gpt2-350m-en
cd gpt2-350m-en
git checkout megatron-deepspeed
the former will download only the desired branch, the latter will download all branches I think.
@StellaAthena Have you made progress with this issue? If not, perhaps I'll take a jab at it!
@stas00 Will the unit test run with the CI? I'm wondering if/whether the test script would have to download the Megatron checkpoints manually on each run.
@StellaAthena Have you made progress with this issue? If not, perhaps I'll take a jab at it!
@stas00 Will the unit test run with the CI? I'm wondering if/whether the test script would have to download the Megatron checkpoints manually on each run.
I have not been able to get to this, ICLR stuff has been getting in the way. You’re welcome to take it over.
@stas00 Will the unit test run with the CI? I'm wondering if/whether the test script would have to download the Megatron checkpoints manually on each run.
The AWS-based CI is currently borked, need to start from scratch and build on GCS, so we do manual make test for now.
Re your question - there is no need to use on a huge checkpoint because both of the download and it'd be much more difficult to compare. It should be easy to create a tiny checkpoint of a few mbs on the fly and then convert it and then compare.
Let me know if you run into difficulties with that.
Reposting from slack as this seems relevant for this:
I'm looking at transformers GPT2 code. https://huggingface.co/transformers/_modules/transformers/models/gpt2/modeling_gpt2.html#GPT2Model and it seems it is doing post layernom whereas the 13B one is trained using PreLN. Maybe this is why we're seeing poor performance in evaluation? Typically the number of params is the same, just the way we use them is different. Is there a preLN gpt in transformers?