Example code results in input_id's of varying lengths
I followed https://github.com/microsoft/i-Code/issues/17#issuecomment-1416369657 in order to load the UdopTokenizer. I then followed the code examples for tokenizing text provided in rvlcdip.py
This amounts to calling tokenizer.tokenize(text) on a word text, appending the resulting sub_tokens to a text_list and then calling tokenizer.convert_tokens_to_ids on that text_list to get input_ids. However this always results in lengths that are longer or shorter than 512. This is despite the fact that tokenizer_config.json has a "model_max_length": 512, param.
Is this provided example code the expected way to encode text?
(it makes sense that the provided code doesn't pad/truncate correctly, but its odd to me that rvlcdip can correctly fine tune without a step in this tokenization piece that ensures the text_list is 512 tokens long)
EDIT I just noticed this pad_tokens function but it doesn't appear to be used anywhere. Is it used automatically once RvlCdipDataset() is created? Also, it doesn't appear to do any truncation