downsampled_imagenet broken
Hi TFDS,
downsampled_imagenet (32x32) gives a 404 (stack trace at end of issue). This is because the imagenet link stored by tfds (https://image-net.org/small/download.php) is broken. The broken link is also featured in some papers such as Pixel Recurrent Neural Networks.
There is a different New currently-working link for 32x32 imagenet (https://image-net.org/download-images.php, if you log in, you can see a 32x32 option).
Let us refer to them as OLD (what TFDS used to host) and NEW (currently on imagenet website).
An anon. ICLR reviewer (see "weaknesses" under reviewer AKwV) mentioned that NEW is "too easy" and cannot be used to compare to old results using OLD. The reviewer also mentioned that OLD floats around the community on some torrent.
TFDS' link to OLD likely broke more recently than 9 months ago since another Google repo shared code that uses tfds to get downsampled_imagenet (I left an issue there https://github.com/google-research/vdm/issues/8) and their datasets.py file was pushed then.
None of these are the same as imagenet_resized.
Purpose:
- for tfds team to consider what to do with the broken link, in light of the above considerations. This helps the library regardless of any research community issues.
- (possibly beyond tfds) clarify difference to researchers and making both versions available
Possible solution:
- if several people reach consensus that they have OLD, it could be posted on tfds as a "old_downsampled_imagenet" to help reproduce existing research that used the data.
Examples of research using OLD
Some ICLR publications from this year already use NEW.
Thanks! Mark
Environment information
-
Operating System: Ubuntu VERSION="18.04.6 LTS (Bionic Beaver)"
-
Python version: 3.9.12
-
tensorflow-datasets/tfds-nightlyversion: tfds '4.7.0' and tfds '4.8.2+nightly' -
tensorflow/tf-nightlyversion: tf '2.10.0' -
Does the issue still exists with the last
tfds-nightlypackage (pip install --upgrade tfds-nightly) ?
Yes
Reproduction instructions
import tensorflow_datasets as tfds
ds = tfds.load('downsampled_imagenet', split='validation', as_supervised=True, batch_size=128)
Link to logs
2023-01-18 12:03:50.178320: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-18 12:03:51.793197: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/marik/tensorflow_datasets/downsampled_imagenet/32x32/2.0.0...
Dl Size...: 0 MiB [00:00, ? MiB/s] | 0/2 [00:00<?, ? url/s]
Dl Completed...: 0%| | 0/2 [00:00<?, ? url/s]
Traceback (most recent call last):
File "/home/marik/imnet2.py", line 2, in <module>
ds = tfds.load('downsampled_imagenet', split='validation', as_supervised=True, batch_size=128)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/logging/__init__.py", line 250, in decorator
return function(*args, **kwargs)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/load.py", line 575, in load
dbuilder.download_and_prepare(**download_and_prepare_kwargs)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 523, in download_and_prepare
self._download_and_prepare(
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1244, in _download_and_prepare
split_generators = self._split_generators( # pylint: disable=unexpected-keyword-arg
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/image/downsampled_imagenet.py", line 102, in _split_generators
train_path, valid_path = dl_manager.download([
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 552, in download
return _map_promise(self._download, url_or_urls)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 770, in _map_promise
res = tf.nest.map_structure(lambda p: p.get(), all_promises) # Wait promises
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 917, in map_structure
structure[0], [func(*x) for x in entries],
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 917, in <listcomp>
structure[0], [func(*x) for x in entries],
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 770, in <lambda>
res = tf.nest.map_structure(lambda p: p.get(), all_promises) # Wait promises
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 512, in get
return self._target_settled_value(_raise=True)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 516, in _target_settled_value
return self._target()._settled_value(_raise)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 226, in _settled_value
reraise(type(raise_val), raise_val, self._traceback)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 844, in handle_future_result
resolve(future.result())
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 217, in _sync_download
with _open_url(url, verify=verify) as (response, iter_content):
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/contextlib.py", line 119, in __enter__
return next(self.gen)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 279, in _open_with_requests
_assert_status(response)
File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 306, in _assert_status
raise DownloadError('Failed to get url {}. HTTP code: {}.'.format(
tensorflow_datasets.core.download.downloader.DownloadError: Failed to get url https://image-net.org/small/train_32x32.tar. HTTP code: 404.
I also reached out to the imagenet moderators to hear their input and will post any response here.
@Kim-Dongjun provided a good explanation and shared the location of the torrent that people use for the original data from pixel rnn. Here is Dongjun's explanation of the discrepancy (which also coincides with things I've heard from some authors at talks/conferences):
- There is a downsampled ImageNet dataset, which I call it "small".
- The small dataset was widely used in the community of generative models for long time
- but StyleGAN-XL, Efficient-VDVAE, or other large-scale papers tend to use ILSVRC12 dataset for their report on ImageNet 32x32 or ImageNet 64x64.
- The small dataset, however, is unattainable officially. It is available at this torrent link
- it is strange that we have to use "torrent" for the research, but as far as I know, there is no other websites that we can download the downsampled "small" ImageNet dataset.
- The signal is that the downsampled dataset has 49999 validation data, whereas the original ILSVRC12 dataset has 50000 validation data.
- The downsampled dataset is from Pixel RNN paper.
Here is a summary.
For imagenet 32x32, some papers use an "old" version and some use a "new" version. My understanding is:
- the "new" one is the one current available here
- the "old" one was previously available here
- the "old" one is still unofficially available at this torrent link. I downloaded this and can share it more directly in case the torrent is too slow.
- TFDS has two imagenet 32x32's : "downsampled_imagenet" and "imagenet_resized"
- tfds "resized" is a different dataset unrelated to this discussion + tfds docs already do a good job at warning that the dataset differs
- tfds "downsampled" currently gives a 404 error because it goes to the "old" link
- unfortunately, it's not always clear in papers who used "old", "new". or "resized", and it affects likelihoods / ability to reproduce research
My proposals are
- tfds choose a default to fix the 404 (maybe "new" since it officially available)
- consider whether it also makes sense to host the "old" one to help reproduce old research. If so, which would be the ground truth source? the torrent? the original pixelrnn authors?
Thanks, curious about others' take on this issue and for others to confirm.