ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

Florence-2: How to add custom tokens during fine-tuning training?

Open lixumin-zai opened this issue 1 year ago • 1 comments

This is the official way to add:

# processing_florence2.py
tokens_to_add = {'additional_special_tokens': \
    tokenizer.additional_special_tokens + \
    ['<od>', '</od>', '<ocr>', '</ocr>'] + \
    [f'<loc_{x}>' for x in range(1000)] + \
    ['<cap>', '</cap>', '<ncap>', '</ncap>','<dcap>', '</dcap>', '<grounding>', '</grounding>', '<seg>', '</seg>', '<sep>', '<region_cap>', '</region_cap>', '<region_to_desciption>', '</region_to_desciption>', '<proposal>', '</proposal>', '<poly>', '</poly>', '<and>']}

The training was added in this way, and the model was adjusted

model.resize_token_embeddings(len(processor.tokenizer))

The model's output is very good at the beginning, but the latter part looks like gibberish

lixumin-zai avatar Jul 29 '24 08:07 lixumin-zai

Perhaps you should consider incorporating a minor modification in the get_model_tokenizer_florence function located at https://github.com/modelscope/swift/blob/main/swift/llm/utils/model.py#L2738.

new_tokens= ["YOUR CUSTOM TOKENS"]
tokenizer.add_tokens(list(new_tokens))
model.resize_token_embeddings(len(tokenizer))

hjh0119 avatar Jul 29 '24 14:07 hjh0119

https://github.com/modelscope/swift/blob/main/swift/llm/utils/model.py#L2738. The file in this link cannot be found in swift 3.1. Where should the special token be added now?

joey9503 avatar Feb 17 '25 09:02 joey9503