How do I tokenize my data to prepare for finetuning 'urchade/gliner_multi-v2.1'

Open AnjaliSetiya opened this issue 10 months ago • 1 comments

@urchade Hello everyone, I want to fine-tune the multi-v2.1 version on my data. As we can see in the example finetune.ipynb the data.json file is being read which is already tokenized. I want to know how can I tokenize my data to use it for fine-tuning. If there is an example file for that. Some context my data is mix bag of ids, alphanumeric items, customer names, punctionation marks etc.

Appreciate any help.

Thanks

Mar 25 '25 06:03 AnjaliSetiya

If you have the raw texts, you could use from gliner.data_processing.tokenizer import WordsSplitter and use WordsSplitter.

Jun 06 '25 10:06 marcg03