Llama fast tokenizer `train_new_from_iterator` returns `TypeError: 'NoneType' object is not subscriptable`
System Info
accelerate==0.18.0 aiohttp==3.8.4 aiosignal==1.3.1 anyio==3.6.2 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arrow==1.2.3 asttokens==2.2.1 async-timeout==4.0.2 attrs==23.1.0 backcall==0.2.0 beautifulsoup4==4.12.2 bitsandbytes==0.38.1 bleach==6.0.0 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==3.1.0 cmake==3.26.3 comm==0.1.3 datasets==2.11.0 debugpy==1.6.7 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.6 evaluate==0.4.0 executing==1.2.0 fastjsonschema==2.16.3 filelock==3.12.0 fqdn==1.5.1 frozenlist==1.3.3 fsspec==2023.4.0 huggingface-hub==0.13.4 idna==3.4 importlib-metadata==6.5.0 importlib-resources==5.12.0 ipykernel==6.22.0 ipython==8.12.0 ipython-genutils==0.2.0 isoduration==20.11.0 jedi==0.18.2 Jinja2==3.1.2 jsonpointer==2.3 jsonschema==4.17.3 jupyter-events==0.6.3 jupyter_client==8.2.0 jupyter_core==5.3.0 jupyter_server==2.5.0 jupyter_server_terminals==0.4.4 jupyterlab-pygments==0.2.2 lit==16.0.1 MarkupSafe==2.1.2 matplotlib-inline==0.1.6 mistune==2.0.5 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.14 nbclassic==0.5.5 nbclient==0.7.3 nbconvert==7.3.1 nbformat==5.8.0 nest-asyncio==1.5.6 networkx==3.1 notebook==6.5.4 notebook_shim==0.2.2 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 packaging==23.1 pandas==2.0.0 pandocfilters==1.5.0 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 pkgutil_resolve_name==1.3.10 platformdirs==3.2.0 prometheus-client==0.16.0 prompt-toolkit==3.0.38 protobuf==3.20.0 psutil==5.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==11.0.0 pycparser==2.21 Pygments==2.15.1 pyrsistent==0.19.3 python-dateutil==2.8.2 python-dotenv==1.0.0 python-json-logger==2.0.7 pytz==2023.3 PyYAML==6.0 pyzmq==25.0.2 regex==2023.3.23 requests==2.28.2 responses==0.18.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 Send2Trash==1.8.0 sentencepiece==0.1.98 six==1.16.0 sniffio==1.3.0 soupsieve==2.4.1 stack-data==0.6.2 sympy==1.11.1 terminado==0.17.1 tinycss2==1.2.1 tokenizers==0.13.3 torch==2.0.0 tornado==6.3 tqdm==4.65.0 traitlets==5.9.0 -e git+https://github.com/huggingface/transformers.git@474bf508dfe0d46fc38585a1bb793e5ba74fddfd#egg=transformers triton==2.0.0 typing_extensions==4.5.0 tzdata==2023.3 uri-template==1.2.0 urllib3==1.26.15 wcwidth==0.2.6 webcolors==1.13 webencodings==0.5.1 websocket-client==1.5.1 xxhash==3.2.0 yarl==1.8.2 zipp==3.15.0
Who can help?
@ArthurZucker , @Narsil
Information
- [] The official example scripts
- [X ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Convert llama weights to hf format
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/downloaded/llama/weights --model_size tokenizer_only --output_dir /output/path
- Train new tokenizer from old.
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained(/output/path)
old_tokenizer.train_new_from_iterator(["I love huggingface!"], 50)
Expected behavior
Behavior
I ran into the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 5
3 old_tokenizer = AutoTokenizer.from_pretrained(PATH_TO_LLAMA_DIR,)
----> 5 old_tokenizer.train_new_from_iterator(["I love huggingface!"], 50)
File ~/transformers/src/transformers/tokenization_utils_fast.py:709, in PreTrainedTokenizerFast.train_new_from_iterator(self, text_iterator, vocab_size, length, new_special_tokens, special_tokens_map, **kwargs)
[707](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=706) if tokenizer_json["model"]["type"] == "Unigram" and unk_token is not None:
[708](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=707) kwargs["unk_token"] = unk_token
--> [709](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=708) if tokenizer_json["pre_tokenizer"]["type"] == "ByteLevel":
[710](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=709) kwargs["initial_alphabet"] = pre_tokenizers_fast.ByteLevel.alphabet()
[712](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=711) trainer_class = MODEL_TO_TRAINER_MAPPING[tokenizer_json["model"]["type"]]
TypeError: 'NoneType' object is not subscriptable
Analysis
Inspecting my tokenizer.json file (tokenizer.zip), I realised my "pre_tokenizer": null, which led to the error.
I'm not sure if it helps, but I had issue converting the llama weights to hf format (step 1) due to the protobuf version bug described here. I fixed it by downgrading my protobuf to version 3.20.
Same problem here. The code appears to be looking for a ByteLevel pretokenizer, but the json.load(_tokenizer) at line 644 of tokenization_utils_fast.py is initializing one with pretokenizer equal to None
Hey! Thanks for reporting! I can reproduce this, indeed it's bug will investigate
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Should have been fixed by #22959