Example code results in input_id's of varying lengths

Open plamb-viso opened this issue 2 years ago • 0 comments

I followed https://github.com/microsoft/i-Code/issues/17#issuecomment-1416369657 in order to load the UdopTokenizer. I then followed the code examples for tokenizing text provided in rvlcdip.py

This amounts to calling tokenizer.tokenize(text) on a word text, appending the resulting sub_tokens to a text_list and then calling tokenizer.convert_tokens_to_ids on that text_list to get input_ids. However this always results in lengths that are longer or shorter than 512. This is despite the fact that tokenizer_config.json has a "model_max_length": 512, param.

Is this provided example code the expected way to encode text?

(it makes sense that the provided code doesn't pad/truncate correctly, but its odd to me that rvlcdip can correctly fine tune without a step in this tokenization piece that ensures the text_list is 512 tokens long)

EDIT I just noticed this pad_tokens function but it doesn't appear to be used anywhere. Is it used automatically once RvlCdipDataset() is created? Also, it doesn't appear to do any truncation

Feb 27 '23 20:02 plamb-viso