transformers icon indicating copy to clipboard operation
transformers copied to clipboard

GPT2 doesn't generate new tokens if pad_token is added

Open antonioalegria opened this issue 1 year ago • 10 comments

System Info

  • transformers version: 4.38.2
  • Platform: macOS-14.3.1-arm64-arm-64bit
  • Python version: 3.11.6
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    • distributed_type: NO
    • mixed_precision: no
    • use_cpu: False
    • debug: False
    • num_processes: 1
    • machine_rank: 0
    • num_machines: 1
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []
    • dynamo_config: {'dynamo_backend': 'INDUCTOR'}
  • PyTorch version (GPU?): 2.2.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker (text model) and @gante (generation)

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

tokenizer.add_special_tokens({'pad_token': '<|PAD|>'})
model.resize_token_embeddings(len(tokenizer))

model.config.pad_token_id = tokenizer.pad_token_id

test = "Hello"

input = tokenizer(test, return_tensors="pt")

outputs = model.generate(input.input_ids, pad_token_id=tokenizer.pad_token_id, attention_mask=input.attention_mask, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=False)) # => Hello<|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|>

If I, instead, set pad_token to be eos_token, it generates properly.

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

tokenizer.pad_token = tokenizer.eos_token

test = "Hello"

input = tokenizer(test, return_tensors="pt")

outputs = model.generate(input.input_ids, attention_mask=input.attention_mask, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=False)) # Hello The first time I saw a new game,

Expected behavior

I would expect adding a padding token to not affect the model's ability to generate. It doesn't seem to affect other decoder-only models.

antonioalegria avatar Mar 27 '24 10:03 antonioalegria

Hey! That's somewhat expected. Resizing the embedding changes the distribution:

In [25]: model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

In [26]: model.pad_token_id = tokenizer.pad_token_id

In [27]: outputs = model.generate(input.input_ids, pad_token_id=tokenizer.eos_token_id, attention_mask=input.attention_mask, max_length=10)

In [28]: print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Hello The first time I saw a new game,

In [29]: model.pad_token_id
Out[29]: 50257

In [30]: model.resize_token_embeddings(len(tokenizer))
Out[30]: Embedding(50258, 768)

In [31]: outputs = model.generate(input.input_ids, pad_token_id=tokenizer.eos_token_id, attention_mask=input.attention_mask, max_length=10)

In [32]: print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Hello<|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|><|PAD|>

ArthurZucker avatar Mar 28 '24 02:03 ArthurZucker

Thank you for the reply.

Seems like something is missing. Though it would change the distribution, why would the new token be at the top? This doesn't happen with some other causal models (e.g. phi-1.5). What would be the recommended course of action to differentiate padding from eos? This could likely be documented as it is relatively common need.

antonioalegria avatar Mar 28 '24 05:03 antonioalegria

GPT2 is a fairly old model. You can check that resizing another size will generate the new tokens. Recommended way is to initialized with https://nlp.stanford.edu/~johnhew/vocab-expansion.html. It also explains the failure

ArthurZucker avatar Mar 28 '24 13:03 ArthurZucker

Thanks! So wouldn't this indicate this would be an improvement that should be integrated in transformers library or a warning raised if embeddings are resized?

antonioalegria avatar Mar 29 '24 06:03 antonioalegria

Yeah it I think we could help users with this. Either add this to the tokenizer when resizing the tokenizer or resize_token_embeddings. TBH I'd rather we properly document this once and forall in the. resize_token_embeddings documentation rather than a warning! DO you want to open a PR for that?

ArthurZucker avatar Mar 30 '24 07:03 ArthurZucker

What are you suggesting: add documentation to resize_token_embeddings in PreTrainedModel?

antonioalegria avatar Apr 01 '24 15:04 antonioalegria

Yes ! 🤗

ArthurZucker avatar Apr 02 '24 08:04 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 27 '24 08:04 github-actions[bot]

Will look at it

antonioalegria avatar Apr 29 '24 08:04 antonioalegria

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 24 '24 08:05 github-actions[bot]