datacomp
datacomp copied to clipboard
DataComp: In search of the next generation of multimodal datasets
Hi all thank you for the great package. I'm currently running into issues downloading [Commonpool X-Large with Pyspark](https://github.com/rom1504/img2dataset/blob/main/examples/distributed_img2dataset_tutorial.md). I've downloaded the metadata files and uploaded them to S3, however my...
We're trying to produce the filtered subset of your `large` pool with the method "Intersection of image-based and CLIP score filtering." I know there is a script provided in the...
Hello! I'm running the command: `python download_upstream.py --scale medium --data_dir medium --skip_shards` After downloading some files it interrupts with the error: ``` File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 94, in thread_map return _executor_map(ThreadPoolExecutor,...
Hi all, thanks for your great work!! I want to ask some relevant technical questions: I have used dataset2metadata to generate the metadata set from the 'shards' and store some...
A recent report definitively found CSAM in LAION-5B, and that dataset has been taken down until the problem can be solved. The DataComp dataset is much larger. Please let us...
Thank you for your excellent work. I'm currently training my own CLIP model and have a question. If I use LAION-2B, COYO-700M, and Datacomp datasets simultaneously for training, will it...
* Needed for https://github.com/mlfoundations/datacomp/issues/59 * Closes https://github.com/mlfoundations/datacomp/issues/66 with https://github.com/mlfoundations/datacomp/pull/58#discussion_r1343359420
* Inspired by https://github.com/rom1504/img2dataset/pull/272 * Depends on https://github.com/mlfoundations/datacomp/pull/58 * Depends on https://github.com/mlfoundations/datacomp/pull/60 # Usage ## Cluster creation ```bash ray up --yes cluster.yml ``` ```bash ray dashboard cluster.yml ``` ## Job...
# Introduction We downloaded the [Datacomp 1B set](https://huggingface.co/datasets/mlfoundations/datacomp_1b). For verification, we only kept an image if its SHA256 checksum of the bytes matches with the corresponding entry in the metadata...