transformers Llama fast tokenizer `train_new_from_iterator` returns `TypeError: 'NoneType' object is not subscriptable`

System Info

accelerate==0.18.0 aiohttp==3.8.4 aiosignal==1.3.1 anyio==3.6.2 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arrow==1.2.3 asttokens==2.2.1 async-timeout==4.0.2 attrs==23.1.0 backcall==0.2.0 beautifulsoup4==4.12.2 bitsandbytes==0.38.1 bleach==6.0.0 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==3.1.0 cmake==3.26.3 comm==0.1.3 datasets==2.11.0 debugpy==1.6.7 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.6 evaluate==0.4.0 executing==1.2.0 fastjsonschema==2.16.3 filelock==3.12.0 fqdn==1.5.1 frozenlist==1.3.3 fsspec==2023.4.0 huggingface-hub==0.13.4 idna==3.4 importlib-metadata==6.5.0 importlib-resources==5.12.0 ipykernel==6.22.0 ipython==8.12.0 ipython-genutils==0.2.0 isoduration==20.11.0 jedi==0.18.2 Jinja2==3.1.2 jsonpointer==2.3 jsonschema==4.17.3 jupyter-events==0.6.3 jupyter_client==8.2.0 jupyter_core==5.3.0 jupyter_server==2.5.0 jupyter_server_terminals==0.4.4 jupyterlab-pygments==0.2.2 lit==16.0.1 MarkupSafe==2.1.2 matplotlib-inline==0.1.6 mistune==2.0.5 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.14 nbclassic==0.5.5 nbclient==0.7.3 nbconvert==7.3.1 nbformat==5.8.0 nest-asyncio==1.5.6 networkx==3.1 notebook==6.5.4 notebook_shim==0.2.2 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 packaging==23.1 pandas==2.0.0 pandocfilters==1.5.0 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 pkgutil_resolve_name==1.3.10 platformdirs==3.2.0 prometheus-client==0.16.0 prompt-toolkit==3.0.38 protobuf==3.20.0 psutil==5.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==11.0.0 pycparser==2.21 Pygments==2.15.1 pyrsistent==0.19.3 python-dateutil==2.8.2 python-dotenv==1.0.0 python-json-logger==2.0.7 pytz==2023.3 PyYAML==6.0 pyzmq==25.0.2 regex==2023.3.23 requests==2.28.2 responses==0.18.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 Send2Trash==1.8.0 sentencepiece==0.1.98 six==1.16.0 sniffio==1.3.0 soupsieve==2.4.1 stack-data==0.6.2 sympy==1.11.1 terminado==0.17.1 tinycss2==1.2.1 tokenizers==0.13.3 torch==2.0.0 tornado==6.3 tqdm==4.65.0 traitlets==5.9.0 -e git+https://github.com/huggingface/transformers.git@474bf508dfe0d46fc38585a1bb793e5ba74fddfd#egg=transformers triton==2.0.0 typing_extensions==4.5.0 tzdata==2023.3 uri-template==1.2.0 urllib3==1.26.15 wcwidth==0.2.6 webcolors==1.13 webencodings==0.5.1 websocket-client==1.5.1 xxhash==3.2.0 yarl==1.8.2 zipp==3.15.0

Who can help?

@ArthurZucker , @Narsil

Information

[] The official example scripts
[X ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Convert llama weights to hf format

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size tokenizer_only --output_dir /output/path

Train new tokenizer from old.

from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained(/output/path)

old_tokenizer.train_new_from_iterator(["I love huggingface!"], 50)

Expected behavior

Behavior

I ran into the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 5
      3 old_tokenizer = AutoTokenizer.from_pretrained(PATH_TO_LLAMA_DIR,)
----> 5 old_tokenizer.train_new_from_iterator(["I love huggingface!"], 50)

File ~/transformers/src/transformers/tokenization_utils_fast.py:709, in PreTrainedTokenizerFast.train_new_from_iterator(self, text_iterator, vocab_size, length, new_special_tokens, special_tokens_map, **kwargs)
    [707](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=706) if tokenizer_json["model"]["type"] == "Unigram" and unk_token is not None:
    [708](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=707)     kwargs["unk_token"] = unk_token
--> [709](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=708) if tokenizer_json["pre_tokenizer"]["type"] == "ByteLevel":
    [710](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=709)     kwargs["initial_alphabet"] = pre_tokenizers_fast.ByteLevel.alphabet()
    [712](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=711) trainer_class = MODEL_TO_TRAINER_MAPPING[tokenizer_json["model"]["type"]]

TypeError: 'NoneType' object is not subscriptable

Analysis

Inspecting my tokenizer.json file (tokenizer.zip), I realised my "pre_tokenizer": null, which led to the error.

I'm not sure if it helps, but I had issue converting the llama weights to hf format (step 1) due to the protobuf version bug described here. I fixed it by downgrading my protobuf to version 3.20.

Apr 20 '23 02:04 larrylawl

Same problem here. The code appears to be looking for a ByteLevel pretokenizer, but the json.load(_tokenizer) at line 644 of tokenization_utils_fast.py is initializing one with pretokenizer equal to None

Apr 22 '23 10:04 giacoballoccu

Hey! Thanks for reporting! I can reproduce this, indeed it's bug will investigate

Apr 24 '23 12:04 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 20 '23 15:05 github-actions[bot]

Should have been fixed by #22959

May 23 '23 09:05 ArthurZucker