Tom
Tom
I'm not sure I understand the structure of the dataset. In general, WebDataset does not randomize the order of your shards. Of course, if you use multiple workers with DataLoader,...
PyTorch has two fundamentally different forms of datasets: indexed datasets, and iterable datasets. Both are recognized by DataLoader but are treated very differently by the PyTorch library. This is just...
Note that you have the option of using a small shell script or compound command, something like this: ``` s3cmd cp s3://... /tmp/$$ && cat /tmp/$$ && rm -f /tmp/$$...
> Could it be that this approach also have better memory usage? since it streams the bytes content instead of loading everything to the machine? The command-line based I/O is...
Sorry for the late response. The best way of dealing with unreliable storage or S3 connections is to write a script that retries until it correctly retrieves the file. You...
WebDataset by default uses allow_pickle=False for .npz and .npy extensions by default. That is the safer and more common default; it's generally not such a good idea to put objects...
Sorry to have been sitting on this for so long. The `repeat=` argument doesn't use `itertools.cycle`, it simply sets `self.repetitions`; the main iterator is this: ``` def iterator(self): """Create an...
LMDBCached wasn't really used in that way in the past (it's very rarely used). But you're right, that should get fixed.
Unbatched is the inverse of batched. Batched should never produce this kind of output. What it should produce in this case is something like: ``` sample = { 'image': torch.rand(2,...
Thanks; that looks like it's unintentional, I'll have a look.