transformers GPT2 doesn't generate new tokens if pad

System Info

transformers version: 4.38.2
Platform: macOS-14.3.1-arm64-arm-64bit
Python version: 3.11.6
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR'}
PyTorch version (GPU?): 2.2.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker (text model) and @gante (generation)

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

tokenizer.add_special_tokens({'pad_token': '<|PAD|>'})
model.resize_token_embeddings(len(tokenizer))

model.config.pad_token_id = tokenizer.pad_token_id

test = "Hello"

input = tokenizer(test, return_tensors="pt")

outputs = model.generate(input.input_ids, pad_token_id=tokenizer.pad_token_id, attention_mask=input.attention_mask, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=False)) # => Hello<|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|>

If I, instead, set pad_token to be eos_token, it generates properly.

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

tokenizer.pad_token = tokenizer.eos_token

test = "Hello"

input = tokenizer(test, return_tensors="pt")

outputs = model.generate(input.input_ids, attention_mask=input.attention_mask, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=False)) # Hello The first time I saw a new game,

Expected behavior

I would expect adding a padding token to not affect the model's ability to generate. It doesn't seem to affect other decoder-only models.

Mar 27 '24 10:03 antonioalegria

Hey! That's somewhat expected. Resizing the embedding changes the distribution:

In [25]: model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

In [26]: model.pad_token_id = tokenizer.pad_token_id

In [27]: outputs = model.generate(input.input_ids, pad_token_id=tokenizer.eos_token_id, attention_mask=input.attention_mask, max_length=10)

In [28]: print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Hello The first time I saw a new game,

In [29]: model.pad_token_id
Out[29]: 50257

In [30]: model.resize_token_embeddings(len(tokenizer))
Out[30]: Embedding(50258, 768)

In [31]: outputs = model.generate(input.input_ids, pad_token_id=tokenizer.eos_token_id, attention_mask=input.attention_mask, max_length=10)

In [32]: print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Hello<|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|>

Mar 28 '24 02:03 ArthurZucker

Thank you for the reply.

Seems like something is missing. Though it would change the distribution, why would the new token be at the top? This doesn't happen with some other causal models (e.g. phi-1.5). What would be the recommended course of action to differentiate padding from eos? This could likely be documented as it is relatively common need.

Mar 28 '24 05:03 antonioalegria

GPT2 is a fairly old model. You can check that resizing another size will generate the new tokens. Recommended way is to initialized with https://nlp.stanford.edu/~johnhew/vocab-expansion.html. It also explains the failure

Mar 28 '24 13:03 ArthurZucker

Thanks! So wouldn't this indicate this would be an improvement that should be integrated in transformers library or a warning raised if embeddings are resized?

Mar 29 '24 06:03 antonioalegria

Yeah it I think we could help users with this. Either add this to the tokenizer when resizing the tokenizer or resize_token_embeddings. TBH I'd rather we properly document this once and forall in the. resize_token_embeddings documentation rather than a warning! DO you want to open a PR for that?

Mar 30 '24 07:03 ArthurZucker

What are you suggesting: add documentation to resize_token_embeddings in PreTrainedModel?

Apr 01 '24 15:04 antonioalegria

Yes ! 🤗

Apr 02 '24 08:04 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 27 '24 08:04 github-actions[bot]

Will look at it

Apr 29 '24 08:04 antonioalegria

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 24 '24 08:05 github-actions[bot]

GPT2 doesn't generate new tokens if pad_token is added

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior