custom datasets
how can i custom my datasets to train PureT model?
You need to construct JSON files (for your own datasets) referring to MSCOCO datasets and generate the necessary files for training. I upload a new notebook file "ICC分词预处理.ipynb" for reference, which is used for the Pre-Processing (The processing of generating necessary files) for Image Chinese Captioning datasets.
thank you! I will try soon.
Im sorry i still have some questions. In your "ICC分词输出.ipynb" , i cant find any about "coco_train_input.pkl". Do you have any tools to transform COCO Caption(for English,not chinese).I mean how do I get all the files under the “mscoco” folder. such as "txt","misc","sent"
The core generation logic (how to generate all necessary files under mscoco folder) is located below these code cells of the snapshot image. I have not saved the pre-processing codes for COCO datasets. Actually, you only need to replace the prefix "ICC_" of all files with "coco_" (such as replace `sent_input_file = './ICC_train_input.pkl'` with `sent_input_file = './coco_train_input.pkl'`) and replace `raw_train_annotation_file` and `raw_val_annotation_file` with MSCOCO annotation JSON file. Their generation logic is consistent on the whole.
Or you can also refer to the reference Github projects listed in the README to find more info.
I'm sorry for my oversight. There is a "dataset_coco. json" file in the "mscoco" directory and I would like to know how this file is generated.
I haven't started running the following code yet.

The "dataset_coco. json" file is the Karpathy split annotation file of MSCOCO Captioning, it is just the re-organization of MSCOCO raw JSON annotation. Maybe you need to refer to https://github.com/karpathy/neuraltalk for more details.
Could you please upload the English version of this file "ICC分词预处理.ipynb"?
How to preprocess for image english captioning datasets?