transformers icon indicating copy to clipboard operation
transformers copied to clipboard

model_max_length arg has no effect when creating bert tokenizer

Open galtay opened this issue 1 year ago • 1 comments

System Info

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.37.2
  • Platform: macOS-14.2.1-arm64-arm-64bit
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
new_tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-uncased', model_max_length=8192)
print(new_tokenizer.model_max_length)
# 8192
old_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
print(old_tokenizer.model_max_length)
# 512

Expected behavior

print(old_tokenizer.model_max_length)
# 8192

galtay avatar Feb 16 '24 06:02 galtay

Hi @galtay, thanks for raising this issue!

It looks related to #29050

cc @LysandreJik

amyeroberts avatar Feb 16 '24 12:02 amyeroberts

In [7]: transformers.__version__
Out[7]: '4.39.0.dev0'

In [3]: nt = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", model_max_length=8192)
In [4]: ot = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)

In [5]: nt.model_max_length
Out[5]: 512

In [6]: ot.model_max_length
Out[6]: 8192

galtay avatar Mar 17 '24 16:03 galtay

Gentle ping @LysandreJik @ArthurZucker

amyeroberts avatar Apr 10 '24 13:04 amyeroberts

This is now fixed on main! It took a bit of time to go through the deprecation cycle, but it's live.

Thanks for the report @galtay!

LysandreJik avatar May 09 '24 15:05 LysandreJik