Jordan Totten comments

Results 10 comments of


                                            Jordan Totten

Resized Dataset for multiprocessing data loading

@tmbdev in the suggested [link](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) from the error message, it illustrates using a `worker_init_fn` like this: ``` #Define a `worker_init_fn` that configures each dataset copy differently def worker_init_fn(worker_id): worker_info =...

Resized Dataset for multiprocessing data loading

Thanks @tmbdev , this was very helpful. I would like to try recreating epochs, first. If that doesnt workout I will look at other options. I tried not specifying `length=...`...

Resized Dataset for multiprocessing data loading

Following config: - `length=epoch_size` - `batch_size=128` - `num_workers=4` Results in the following: - No IterableDataset length mismatch warnings - Still getting BrokenPipe errors after the last epoch - Training loss...

Resized Dataset for multiprocessing data loading

You are correct that the "BrokenPipe Error" is being ignored and not causing the errors at the end (e.g., DataLoader gracefully exiting). - As i experimented with larger values for...

Resized Dataset for multiprocessing data loading

Thank you for that @tmbdev ! I'm going to test different configurations to better understand whats going on after the last epoch. As you pointed out, the BrokenPipe error is...

[BUG] merlin models on vertex ai training - cuda error

Actually I'm confused about the Driver Version. You see the results from `nvidia-smi` above. However when I `env | grep CUDA` i get the following ``` env | grep CUDA...

[BUG] merlin models on vertex ai training - cuda error

I'm not sure why this makes a difference, but here is what i've done to get around this error and start model training: They use `command` and `args` differently. This...

[BUG] merlin models on vertex ai training - cuda error

Hey @jperez999 - This should be the container. I think becasue this is running with bash it shouldnt be a driver problem, right? I think it has to do with...

[BUG] merlin models on vertex ai training - cuda error

thanks @rnyak

Petastorm sharding + Distributed PyTorch

@megaserg I am trying to do something similar with TPU Pods (multi-machine), PyTorch XLA reading imagnet petastorm dataset from GCS buckets. Did you find that one of your options worked...