Could you please provide your code for downloading CC3M+CC12M+SBU data from the json file you provided?
The json files contain the image url and text. You may write a script to download images from the url. This code could be helpful: https://github.com/rom1504/img2dataset.
I find that the code read the image file directly from the path, instead of .tar or parquet. However, img2dataset says that "handling more than a million files in standard filesystem does not work well." Thus, it suggests to use webdataset format. Do I have to untar the file to support your data reading strategy?
Yes you need to have raw image path to work with our annotation file. You can choose to recreate new tars and use webdataset.
Got it! Thanks for your reply :)