transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Llama fast tokenizer `train_new_from_iterator` returns `TypeError: 'NoneType' object is not subscriptable`

Open larrylawl opened this issue 2 years ago • 2 comments

System Info

accelerate==0.18.0 aiohttp==3.8.4 aiosignal==1.3.1 anyio==3.6.2 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arrow==1.2.3 asttokens==2.2.1 async-timeout==4.0.2 attrs==23.1.0 backcall==0.2.0 beautifulsoup4==4.12.2 bitsandbytes==0.38.1 bleach==6.0.0 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==3.1.0 cmake==3.26.3 comm==0.1.3 datasets==2.11.0 debugpy==1.6.7 decorator==5.1.1 defusedxml==0.7.1 dill==0.3.6 evaluate==0.4.0 executing==1.2.0 fastjsonschema==2.16.3 filelock==3.12.0 fqdn==1.5.1 frozenlist==1.3.3 fsspec==2023.4.0 huggingface-hub==0.13.4 idna==3.4 importlib-metadata==6.5.0 importlib-resources==5.12.0 ipykernel==6.22.0 ipython==8.12.0 ipython-genutils==0.2.0 isoduration==20.11.0 jedi==0.18.2 Jinja2==3.1.2 jsonpointer==2.3 jsonschema==4.17.3 jupyter-events==0.6.3 jupyter_client==8.2.0 jupyter_core==5.3.0 jupyter_server==2.5.0 jupyter_server_terminals==0.4.4 jupyterlab-pygments==0.2.2 lit==16.0.1 MarkupSafe==2.1.2 matplotlib-inline==0.1.6 mistune==2.0.5 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.14 nbclassic==0.5.5 nbclient==0.7.3 nbconvert==7.3.1 nbformat==5.8.0 nest-asyncio==1.5.6 networkx==3.1 notebook==6.5.4 notebook_shim==0.2.2 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 packaging==23.1 pandas==2.0.0 pandocfilters==1.5.0 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 pkgutil_resolve_name==1.3.10 platformdirs==3.2.0 prometheus-client==0.16.0 prompt-toolkit==3.0.38 protobuf==3.20.0 psutil==5.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==11.0.0 pycparser==2.21 Pygments==2.15.1 pyrsistent==0.19.3 python-dateutil==2.8.2 python-dotenv==1.0.0 python-json-logger==2.0.7 pytz==2023.3 PyYAML==6.0 pyzmq==25.0.2 regex==2023.3.23 requests==2.28.2 responses==0.18.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 Send2Trash==1.8.0 sentencepiece==0.1.98 six==1.16.0 sniffio==1.3.0 soupsieve==2.4.1 stack-data==0.6.2 sympy==1.11.1 terminado==0.17.1 tinycss2==1.2.1 tokenizers==0.13.3 torch==2.0.0 tornado==6.3 tqdm==4.65.0 traitlets==5.9.0 -e git+https://github.com/huggingface/transformers.git@474bf508dfe0d46fc38585a1bb793e5ba74fddfd#egg=transformers triton==2.0.0 typing_extensions==4.5.0 tzdata==2023.3 uri-template==1.2.0 urllib3==1.26.15 wcwidth==0.2.6 webcolors==1.13 webencodings==0.5.1 websocket-client==1.5.1 xxhash==3.2.0 yarl==1.8.2 zipp==3.15.0

Who can help?

@ArthurZucker , @Narsil

Information

  • [] The official example scripts
  • [X ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Convert llama weights to hf format
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size tokenizer_only --output_dir /output/path
  1. Train new tokenizer from old.
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained(/output/path)

old_tokenizer.train_new_from_iterator(["I love huggingface!"], 50)

Expected behavior

Behavior

I ran into the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 5
      3 old_tokenizer = AutoTokenizer.from_pretrained(PATH_TO_LLAMA_DIR,)
----> 5 old_tokenizer.train_new_from_iterator(["I love huggingface!"], 50)

File ~/transformers/src/transformers/tokenization_utils_fast.py:709, in PreTrainedTokenizerFast.train_new_from_iterator(self, text_iterator, vocab_size, length, new_special_tokens, special_tokens_map, **kwargs)
    [707](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=706) if tokenizer_json["model"]["type"] == "Unigram" and unk_token is not None:
    [708](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=707)     kwargs["unk_token"] = unk_token
--> [709](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=708) if tokenizer_json["pre_tokenizer"]["type"] == "ByteLevel":
    [710](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=709)     kwargs["initial_alphabet"] = pre_tokenizers_fast.ByteLevel.alphabet()
    [712](file:///home/jovyan/transformers/src/transformers/tokenization_utils_fast.py?line=711) trainer_class = MODEL_TO_TRAINER_MAPPING[tokenizer_json["model"]["type"]]

TypeError: 'NoneType' object is not subscriptable

Analysis

Inspecting my tokenizer.json file (tokenizer.zip), I realised my "pre_tokenizer": null, which led to the error.

I'm not sure if it helps, but I had issue converting the llama weights to hf format (step 1) due to the protobuf version bug described here. I fixed it by downgrading my protobuf to version 3.20.

larrylawl avatar Apr 20 '23 02:04 larrylawl

Same problem here. The code appears to be looking for a ByteLevel pretokenizer, but the json.load(_tokenizer) at line 644 of tokenization_utils_fast.py is initializing one with pretokenizer equal to None

giacoballoccu avatar Apr 22 '23 10:04 giacoballoccu

Hey! Thanks for reporting! I can reproduce this, indeed it's bug will investigate

ArthurZucker avatar Apr 24 '23 12:04 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 20 '23 15:05 github-actions[bot]

Should have been fixed by #22959

ArthurZucker avatar May 23 '23 09:05 ArthurZucker