transformers icon indicating copy to clipboard operation
transformers copied to clipboard

RecursionError: maximum recursion depth exceeded while getting the str of an object.

Open EZlzh opened this issue 2 years ago β€’ 3 comments

System Info Python 3.8.10 transformers 4.29.0.dev0 sentencepiece 0.1.97

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Reproduction In https://github.com/CarperAI/trlx/tree/main/examples python https://ppo_sentiments_llama.py The loops occur as follows:

/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:250 in β”‚ β”‚ convert_tokens_to_ids β”‚ β”‚ β”‚ β”‚ 247 β”‚ β”‚ β”‚ return None β”‚ β”‚ 248 β”‚ β”‚ β”‚ β”‚ 249 β”‚ β”‚ if isinstance(tokens, str): β”‚ β”‚ ❱ 250 β”‚ β”‚ β”‚ return self._convert_token_to_id_with_added_voc(tokens) β”‚ β”‚ 251 β”‚ β”‚ β”‚ β”‚ 252 β”‚ β”‚ ids = [] β”‚ β”‚ 253 β”‚ β”‚ for token in tokens: β”‚ β”‚ β”‚ β”‚ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:260 in β”‚ β”‚ _convert_token_to_id_with_added_voc β”‚ β”‚ β”‚ β”‚ 257 β”‚ def _convert_token_to_id_with_added_voc(self, token: str) -> int: β”‚ β”‚ 258 β”‚ β”‚ index = self._tokenizer.token_to_id(token) β”‚ β”‚ 259 β”‚ β”‚ if index is None: β”‚ β”‚ ❱ 260 β”‚ β”‚ β”‚ return self.unk_token_id β”‚ β”‚ 261 β”‚ β”‚ return index β”‚ β”‚ 262 β”‚ β”‚ β”‚ 263 β”‚ def _convert_id_to_token(self, index: int) -> Optional[str]: β”‚ β”‚ β”‚ β”‚ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:1141 in β”‚ β”‚ unk_token_id β”‚ β”‚ β”‚ β”‚ 1138 β”‚ β”‚ """ β”‚ β”‚ 1139 β”‚ β”‚ if self._unk_token is None: β”‚ β”‚ 1140 β”‚ β”‚ β”‚ return None β”‚ β”‚ ❱ 1141 β”‚ β”‚ return self.convert_tokens_to_ids(self.unk_token) β”‚ β”‚ 1142 β”‚ β”‚ β”‚ 1143 β”‚ @property β”‚ β”‚ 1144 β”‚ def sep_token_id(self) -> Optional[int]: β”‚ β”‚ β”‚ β”‚ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:250 in β”‚ β”‚ convert_tokens_to_ids β”‚ β”‚ β”‚ β”‚ 247 β”‚ β”‚ β”‚ return None β”‚ β”‚ 248 β”‚ β”‚ β”‚ β”‚ 249 β”‚ β”‚ if isinstance(tokens, str): β”‚ β”‚ ❱ 250 β”‚ β”‚ β”‚ return self._convert_token_to_id_with_added_voc(tokens) β”‚ β”‚ 251 β”‚ β”‚ β”‚ β”‚ 252 β”‚ β”‚ ids = [] β”‚ β”‚ 253 β”‚ β”‚ for token in tokens:

... Until

/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:1021 in unk_token β”‚ β”‚ β”‚ β”‚ 1018 β”‚ β”‚ β”‚ if self.verbose: β”‚ β”‚ 1019 β”‚ β”‚ β”‚ β”‚ logger.error("Using unk_token, but it is not set yet.") β”‚ β”‚ 1020 β”‚ β”‚ β”‚ return None β”‚ β”‚ ❱ 1021 β”‚ β”‚ return str(self._unk_token) β”‚ β”‚ 1022 β”‚ β”‚ β”‚ 1023 β”‚ @property β”‚ β”‚ 1024 β”‚ def sep_token(self) -> str:

RecursionError: maximum recursion depth exceeded while getting the str of an object

Expected behavior Is the algorithm expected to call the function convert_tokens_to_ids in tokenization_utils.py instead of tokenization_utils_fast.py?

Thanks!

EZlzh avatar Apr 14 '23 08:04 EZlzh

cc @ArthurZucker

amyeroberts avatar Apr 14 '23 13:04 amyeroberts

same problem, is there any progress?

c-box avatar Apr 18 '23 09:04 c-box

Hey! The main issue is that they did not update the tokenizer files at "decapoda-research/llama-7b-hf" but they are using the latest version of transformers. The tokenizer was fixed see #22402 and corrected. Nothing we can do on our end...

ArthurZucker avatar Apr 24 '23 14:04 ArthurZucker

@ArthurZucker I am facing a similar issue with openllama

save_dir = "../open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/"
tokenizer = AutoTokenizer.from_pretrained(save_dir)
tokenizer.bos_token_id

calling tokenizer.bos_token_id this causes max recursion depth error.

tokenizer
LlamaTokenizerFast(name_or_path='../open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)

transformers version = 4.29.1

tokenizer_config.json

{
  "bos_token": "",
  "eos_token": "",
  "model_max_length": 1e+30,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": ""
}

Initializing as following works but I am not sure if this should be needed:

tokenizer = AutoTokenizer.from_pretrained(save_dir, unk_token="<unk>",
                                                    bos_token="<s>",
                                                    eos_token="</s>")

KeremTurgutlu avatar May 14 '23 00:05 KeremTurgutlu

So.... Again, if you are not using the latest / most recently converted tokenizer, I cannot help you. Checkout huggyllama/llama-7b which has a working tokenizer.

ArthurZucker avatar May 26 '23 09:05 ArthurZucker