GPT2 doesn't generate new tokens if pad_token is added
System Info
-
transformersversion: 4.38.2 - Platform: macOS-14.3.1-arm64-arm-64bit
- Python version: 3.11.6
- Huggingface_hub version: 0.20.3
- Safetensors version: 0.4.2
- Accelerate version: 0.28.0
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR'}
- PyTorch version (GPU?): 2.2.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
@ArthurZucker (text model) and @gante (generation)
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
tokenizer.add_special_tokens({'pad_token': '<|PAD|>'})
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id
test = "Hello"
input = tokenizer(test, return_tensors="pt")
outputs = model.generate(input.input_ids, pad_token_id=tokenizer.pad_token_id, attention_mask=input.attention_mask, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=False)) # => Hello<|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|>
If I, instead, set pad_token to be eos_token, it generates properly.
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
test = "Hello"
input = tokenizer(test, return_tensors="pt")
outputs = model.generate(input.input_ids, attention_mask=input.attention_mask, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=False)) # Hello The first time I saw a new game,
Expected behavior
I would expect adding a padding token to not affect the model's ability to generate. It doesn't seem to affect other decoder-only models.
Hey! That's somewhat expected. Resizing the embedding changes the distribution:
In [25]: model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
In [26]: model.pad_token_id = tokenizer.pad_token_id
In [27]: outputs = model.generate(input.input_ids, pad_token_id=tokenizer.eos_token_id, attention_mask=input.attention_mask, max_length=10)
In [28]: print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Hello The first time I saw a new game,
In [29]: model.pad_token_id
Out[29]: 50257
In [30]: model.resize_token_embeddings(len(tokenizer))
Out[30]: Embedding(50258, 768)
In [31]: outputs = model.generate(input.input_ids, pad_token_id=tokenizer.eos_token_id, attention_mask=input.attention_mask, max_length=10)
In [32]: print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Hello<|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|>
Thank you for the reply.
Seems like something is missing. Though it would change the distribution, why would the new token be at the top? This doesn't happen with some other causal models (e.g. phi-1.5). What would be the recommended course of action to differentiate padding from eos? This could likely be documented as it is relatively common need.
GPT2 is a fairly old model. You can check that resizing another size will generate the new tokens. Recommended way is to initialized with https://nlp.stanford.edu/~johnhew/vocab-expansion.html. It also explains the failure
Thanks! So wouldn't this indicate this would be an improvement that should be integrated in transformers library or a warning raised if embeddings are resized?
Yeah it I think we could help users with this. Either add this to the tokenizer when resizing the tokenizer or resize_token_embeddings. TBH I'd rather we properly document this once and forall in the. resize_token_embeddings documentation rather than a warning! DO you want to open a PR for that?
What are you suggesting: add documentation to resize_token_embeddings in PreTrainedModel?
Yes ! 🤗
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Will look at it
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.