Tom comments

Results 206 comments of

Tom

How to prevent shuffling due to num_workers != 0 ? Can WebLoader objects be aggregated?

I'm not sure I understand the structure of the dataset. In general, WebDataset does not randomize the order of your shards. Of course, if you use multiple workers with DataLoader,...

How to implement batch sampler on webdataset?

PyTorch has two fundamentally different forms of datasets: indexed datasets, and iterable datasets. Both are recognized by DataLoader but are treated very differently by the PyTorch library. This is just...

Broken pipe errors when using `pipe:`-wrapped S3 URLs — workaround by custom gopen scheme implementation

Note that you have the option of using a small shell script or compound command, something like this: ``` s3cmd cp s3://... /tmp/$$ && cat /tmp/$$ && rm -f /tmp/$$...

Broken pipe errors when using `pipe:`-wrapped S3 URLs — workaround by custom gopen scheme implementation

> Could it be that this approach also have better memory usage? since it streams the bytes content instead of loading everything to the machine? The command-line based I/O is...

Possibly reading from tarfiles before file is streamed from S3?

Sorry for the late response. The best way of dealing with unreliable storage or S3 connections is to write a script that retries until it correctly retrieves the file. You...

How to decode numpy arrays that require allow_pickle=True

WebDataset by default uses allow_pickle=False for .npz and .npy extensions by default. That is the safer and more common default; it's generally not such a good idea to put objects...

Memory leak during training with standard DataLoader coupled with WebDataset dataloader

Sorry to have been sitting on this for so long. The `repeat=` argument doesn't use `itertools.cycle`, it simply sets `self.repetitions`; the main iterator is this: ``` def iterator(self): """Create an...

lmdb_cached method & multiple dataloader workers

LMDBCached wasn't really used in that way in the past (it's very rarely used). But you're right, that should get fixed.

collated nested dicts are not supported in `wds.filters._unbatched`

Unbatched is the inverse of batched. Batched should never produce this kind of output. What it should produce in this case is something like: ``` sample = { 'image': torch.rand(2,...

cache_dir behavior changed from 0.2.88

Thanks; that looks like it's unintentional, I'll have a look.