FastChat Language distribution of ShareGPT 70K conversation dataset for FastChat T5

What are all the languages present in the ShareGPT 70,000 conversation dataset which was used to fine-tune FastChat-T5?

The ReadMe file points to data_cleaning.md which was used to get data from ShareGPT. Within data_cleaning.md seems like sharegpt_clean_lang.json contains the list of languages in consideration and some languages are skipped.

Jun 05 '23 08:06 Mihir2

how can i finetune with bounds of datasets?

Jul 20 '23 03:07 kkkparty

What are all the languages present in the ShareGPT 70,000 conversation dataset which was used to fine-tune FastChat-T5?

The ReadMe file points to data_cleaning.md which was used to get data from ShareGPT. Within data_cleaning.md seems like sharegpt_clean_lang.json contains the list of languages in consideration and some languages are skipped.

Hi I have the same question about the language distribution, do you have any idea?

Apr 08 '24 12:04 Z1zs