Jordan Totten
Jordan Totten
@tmbdev in the suggested [link](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) from the error message, it illustrates using a `worker_init_fn` like this: ``` #Define a `worker_init_fn` that configures each dataset copy differently def worker_init_fn(worker_id): worker_info =...
Thanks @tmbdev , this was very helpful. I would like to try recreating epochs, first. If that doesnt workout I will look at other options. I tried not specifying `length=...`...
Following config: - `length=epoch_size` - `batch_size=128` - `num_workers=4` Results in the following: - No IterableDataset length mismatch warnings - Still getting BrokenPipe errors after the last epoch - Training loss...
You are correct that the "BrokenPipe Error" is being ignored and not causing the errors at the end (e.g., DataLoader gracefully exiting). - As i experimented with larger values for...
Thank you for that @tmbdev ! I'm going to test different configurations to better understand whats going on after the last epoch. As you pointed out, the BrokenPipe error is...
Actually I'm confused about the Driver Version. You see the results from `nvidia-smi` above. However when I `env | grep CUDA` i get the following ``` env | grep CUDA...
I'm not sure why this makes a difference, but here is what i've done to get around this error and start model training: They use `command` and `args` differently. This...
Hey @jperez999 - This should be the container. I think becasue this is running with bash it shouldnt be a driver problem, right? I think it has to do with...
thanks @rnyak
@megaserg I am trying to do something similar with TPU Pods (multi-machine), PyTorch XLA reading imagnet petastorm dataset from GCS buckets. Did you find that one of your options worked...