PureT custom datasets

how can i custom my datasets to train PureT model?

May 23 '22 16:05 zml110120

You need to construct JSON files (for your own datasets) referring to MSCOCO datasets and generate the necessary files for training. I upload a new notebook file "ICC分词预处理.ipynb" for reference, which is used for the Pre-Processing (The processing of generating necessary files) for Image Chinese Captioning datasets.

May 24 '22 05:05 232525

thank you! I will try soon.

May 24 '22 06:05 zml110120

Im sorry i still have some questions. In your "ICC分词输出.ipynb" , i cant find any about "coco_train_input.pkl". Do you have any tools to transform COCO Caption(for English,not chinese).I mean how do I get all the files under the “mscoco” folder. such as "txt","misc","sent"

May 24 '22 10:05 zml110120

The core generation logic (how to generate all necessary files under mscoco folder) is located below these code cells of the snapshot image. I have not saved the pre-processing codes for COCO datasets. Actually, you only need to replace the prefix "ICC_" of all files with "coco_" (such as replace `sent_input_file = './ICC_train_input.pkl'` with `sent_input_file = './coco_train_input.pkl'`) and replace `raw_train_annotation_file` and `raw_val_annotation_file` with MSCOCO annotation JSON file. Their generation logic is consistent on the whole. Or you can also refer to the reference Github projects listed in the README to find more info.

May 24 '22 10:05 232525

I'm sorry for my oversight. There is a "dataset_coco. json" file in the "mscoco" directory and I would like to know how this file is generated. I haven't started running the following code yet.

May 25 '22 04:05 zml110120

The "dataset_coco. json" file is the Karpathy split annotation file of MSCOCO Captioning, it is just the re-organization of MSCOCO raw JSON annotation. Maybe you need to refer to https://github.com/karpathy/neuraltalk for more details.

May 25 '22 05:05 232525

Could you please upload the English version of this file "ICC分词预处理.ipynb"?

Aug 21 '22 21:08 Debolena7

How to preprocess for image english captioning datasets?

Apr 03 '24 03:04 Sparkle-Q