RecursionError: maximum recursion depth exceeded while getting the str of an object.
System Info Python 3.8.10 transformers 4.29.0.dev0 sentencepiece 0.1.97
Information
- [x] The official example scripts
- [ ] My own modified scripts
Reproduction
In https://github.com/CarperAI/trlx/tree/main/examples
python https://ppo_sentiments_llama.py
The loops occur as follows:
/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:250 in β β convert_tokens_to_ids β β β β 247 β β β return None β β 248 β β β β 249 β β if isinstance(tokens, str): β β β± 250 β β β return self._convert_token_to_id_with_added_voc(tokens) β β 251 β β β β 252 β β ids = [] β β 253 β β for token in tokens: β β β β /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:260 in β β _convert_token_to_id_with_added_voc β β β β 257 β def _convert_token_to_id_with_added_voc(self, token: str) -> int: β β 258 β β index = self._tokenizer.token_to_id(token) β β 259 β β if index is None: β β β± 260 β β β return self.unk_token_id β β 261 β β return index β β 262 β β β 263 β def _convert_id_to_token(self, index: int) -> Optional[str]: β β β β /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:1141 in β β unk_token_id β β β β 1138 β β """ β β 1139 β β if self._unk_token is None: β β 1140 β β β return None β β β± 1141 β β return self.convert_tokens_to_ids(self.unk_token) β β 1142 β β β 1143 β @property β β 1144 β def sep_token_id(self) -> Optional[int]: β β β β /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py:250 in β β convert_tokens_to_ids β β β β 247 β β β return None β β 248 β β β β 249 β β if isinstance(tokens, str): β β β± 250 β β β return self._convert_token_to_id_with_added_voc(tokens) β β 251 β β β β 252 β β ids = [] β β 253 β β for token in tokens:
... Until
/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:1021 in unk_token β β β β 1018 β β β if self.verbose: β β 1019 β β β β logger.error("Using unk_token, but it is not set yet.") β β 1020 β β β return None β β β± 1021 β β return str(self._unk_token) β β 1022 β β β 1023 β @property β β 1024 β def sep_token(self) -> str:
RecursionError: maximum recursion depth exceeded while getting the str of an object
Expected behavior
Is the algorithm expected to call the function convert_tokens_to_ids in tokenization_utils.py instead of tokenization_utils_fast.py?
Thanks!
cc @ArthurZucker
same problem, is there any progressοΌ
Hey! The main issue is that they did not update the tokenizer files at "decapoda-research/llama-7b-hf" but they are using the latest version of transformers. The tokenizer was fixed see #22402 and corrected. Nothing we can do on our end...
@ArthurZucker I am facing a similar issue with openllama
save_dir = "../open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/"
tokenizer = AutoTokenizer.from_pretrained(save_dir)
tokenizer.bos_token_id
calling tokenizer.bos_token_id this causes max recursion depth error.
tokenizer
LlamaTokenizerFast(name_or_path='../open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)
transformers version = 4.29.1
tokenizer_config.json
{
"bos_token": "",
"eos_token": "",
"model_max_length": 1e+30,
"tokenizer_class": "LlamaTokenizer",
"unk_token": ""
}
Initializing as following works but I am not sure if this should be needed:
tokenizer = AutoTokenizer.from_pretrained(save_dir, unk_token="<unk>",
bos_token="<s>",
eos_token="</s>")
So.... Again, if you are not using the latest / most recently converted tokenizer, I cannot help you. Checkout huggyllama/llama-7b which has a working tokenizer.