transformers RecursionError: maximum recursion depth exceeded while getting the str of an object.

System Info Python 3.8.10 transformers 4.29.0.dev0 sentencepiece 0.1.97

Information

[x] The official example scripts
[ ] My own modified scripts

Reproduction In https://github.com/CarperAI/trlx/tree/main/examples python https://ppo_sentiments_llama.py The loops occur as follows:

/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:250 in │ │ convert_tokens_to_ids │ │ │ │ 247 │ │ │ return None │ │ 248 │ │ │ │ 249 │ │ if isinstance(tokens, str): │ │ ❱ 250 │ │ │ return self._convert_token_to_id_with_added_voc(tokens) │ │ 251 │ │ │ │ 252 │ │ ids = [] │ │ 253 │ │ for token in tokens: │ │ │ │ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:260 in │ │ _convert_token_to_id_with_added_voc │ │ │ │ 257 │ def _convert_token_to_id_with_added_voc(self, token: str) -> int: │ │ 258 │ │ index = self._tokenizer.token_to_id(token) │ │ 259 │ │ if index is None: │ │ ❱ 260 │ │ │ return self.unk_token_id │ │ 261 │ │ return index │ │ 262 │ │ │ 263 │ def _convert_id_to_token(self, index: int) -> Optional[str]: │ │ │ │ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:1141 in │ │ unk_token_id │ │ │ │ 1138 │ │ """ │ │ 1139 │ │ if self._unk_token is None: │ │ 1140 │ │ │ return None │ │ ❱ 1141 │ │ return self.convert_tokens_to_ids(self.unk_token) │ │ 1142 │ │ │ 1143 │ @property │ │ 1144 │ def sep_token_id(self) -> Optional[int]: │ │ │ │ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:250 in │ │ convert_tokens_to_ids │ │ │ │ 247 │ │ │ return None │ │ 248 │ │ │ │ 249 │ │ if isinstance(tokens, str): │ │ ❱ 250 │ │ │ return self._convert_token_to_id_with_added_voc(tokens) │ │ 251 │ │ │ │ 252 │ │ ids = [] │ │ 253 │ │ for token in tokens:

... Until

/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:1021 in unk_token │ │ │ │ 1018 │ │ │ if self.verbose: │ │ 1019 │ │ │ │ logger.error("Using unk_token, but it is not set yet.") │ │ 1020 │ │ │ return None │ │ ❱ 1021 │ │ return str(self._unk_token) │ │ 1022 │ │ │ 1023 │ @property │ │ 1024 │ def sep_token(self) -> str:

RecursionError: maximum recursion depth exceeded while getting the str of an object

Expected behavior Is the algorithm expected to call the function convert_tokens_to_ids in tokenization_utils.py instead of tokenization_utils_fast.py?

Thanks!

Apr 14 '23 08:04 EZlzh

cc @ArthurZucker

Apr 14 '23 13:04 amyeroberts

same problem, is there any progress？

Apr 18 '23 09:04 c-box

Hey! The main issue is that they did not update the tokenizer files at "decapoda-research/llama-7b-hf" but they are using the latest version of transformers. The tokenizer was fixed see #22402 and corrected. Nothing we can do on our end...

Apr 24 '23 14:04 ArthurZucker

@ArthurZucker I am facing a similar issue with openllama

save_dir = "../open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/"
tokenizer = AutoTokenizer.from_pretrained(save_dir)
tokenizer.bos_token_id

calling tokenizer.bos_token_id this causes max recursion depth error.

tokenizer
LlamaTokenizerFast(name_or_path='../open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)

transformers version = 4.29.1

tokenizer_config.json

{
  "bos_token": "",
  "eos_token": "",
  "model_max_length": 1e+30,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": ""
}

Initializing as following works but I am not sure if this should be needed:

tokenizer = AutoTokenizer.from_pretrained(save_dir, unk_token="<unk>",
                                                    bos_token="<s>",
                                                    eos_token="</s>")

May 14 '23 00:05 KeremTurgutlu

So.... Again, if you are not using the latest / most recently converted tokenizer, I cannot help you. Checkout huggyllama/llama-7b which has a working tokenizer.

May 26 '23 09:05 ArthurZucker