latent-diffusion
latent-diffusion copied to clipboard
Why only use pre-trained BERT Tokenizer but not the entire pre-trained BERT model(including the pre-trained encoder)?
I am not sure why the implementation only use the tokenizer from hugging face but did not use the pre-trained encoder. I mean why need to retrain the BERT-like transformer? Is the text embedding from the original BERT model not good enough? And why not use fine-tune instead of training from scratch?