ms-swift
ms-swift copied to clipboard
Florence-2: How to add custom tokens during fine-tuning training?
This is the official way to add:
# processing_florence2.py
tokens_to_add = {'additional_special_tokens': \
tokenizer.additional_special_tokens + \
['<od>', '</od>', '<ocr>', '</ocr>'] + \
[f'<loc_{x}>' for x in range(1000)] + \
['<cap>', '</cap>', '<ncap>', '</ncap>','<dcap>', '</dcap>', '<grounding>', '</grounding>', '<seg>', '</seg>', '<sep>', '<region_cap>', '</region_cap>', '<region_to_desciption>', '</region_to_desciption>', '<proposal>', '</proposal>', '<poly>', '</poly>', '<and>']}
The training was added in this way, and the model was adjusted
model.resize_token_embeddings(len(processor.tokenizer))
The model's output is very good at the beginning, but the latter part looks like gibberish
Perhaps you should consider incorporating a minor modification in the get_model_tokenizer_florence function located at https://github.com/modelscope/swift/blob/main/swift/llm/utils/model.py#L2738.
new_tokens= ["YOUR CUSTOM TOKENS"]
tokenizer.add_tokens(list(new_tokens))
model.resize_token_embeddings(len(tokenizer))
https://github.com/modelscope/swift/blob/main/swift/llm/utils/model.py#L2738. The file in this link cannot be found in swift 3.1. Where should the special token be added now?