FixMatch-pytorch icon indicating copy to clipboard operation
FixMatch-pytorch copied to clipboard

Error using single gpu for training

Open Adnan-Khan7 opened this issue 3 years ago • 5 comments

Thanks for the work you have done.

I encounter the following error using the single GPU training,
ValueError:num_samples should be a positive integer value, but got num_samples=-67108864

Command I am using is; python train.py --rank 0 --gpu 0

Can you please assist?

Thanks

Adnan-Khan7 avatar Nov 09 '22 18:11 Adnan-Khan7

Hello @Adnan-Khan7 , could you let me know the details errors such as traceback and the code line ?

LeeDoYup avatar Nov 09 '22 18:11 LeeDoYup

sure, please have a look at the traceback

Traceback (most recent call last): File "train.py", line 319, in main(args) File "train.py", line 67, in main main_worker(args.gpu, ngpus_per_node, args) File "train.py", line 194, in main_worker loader_dict['train_lb'] = get_data_loader(dset_dict['train_lb'], File "/home/adnan.khan/FixMatch-pytorch/datasets/data_utils.py", line 120, in get_data_loader data_sampler = data_sampler(dset, replacement, num_samples, generator) File "/home/adnan.khan/.conda/envs/fixmatch/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 107, in init raise ValueError("num_samples should be a positive integer " ValueError: num_samples should be a positive integer value, but got num_samples=-67108864

Adnan-Khan7 avatar Nov 09 '22 18:11 Adnan-Khan7

have you change some default arguments? because there is no logic to make the num_samples be negative.

is the same with the command python train.py --world-size 1 --rank 0 ?

LeeDoYup avatar Nov 09 '22 19:11 LeeDoYup

I didn't change any other default arguments. Adding --world-size 1 now generates ZeroDivisionError, please see the below command that I am running

python train.py --world-size 1 --rank 0 --overwrite

train.py:40: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints. warnings.warn('You have chosen to seed training. ' Traceback (most recent call last): File "train.py", line 319, in main(args) File "train.py", line 67, in main main_worker(args.gpu, ngpus_per_node, args) File "train.py", line 102, in main_worker if args.rank % ngpus_per_node == 0: ZeroDivisionError: integer division or modulo by zero

by adding --gpu 0 python train.py --world-size 1 --rank 0 --gpu 0 --overwrite generates same error, but with different warning

train.py:40: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints. warnings.warn('You have chosen to seed training. ' train.py:47: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism. warnings.warn('You have chosen a specific GPU. This will completely ' Traceback (most recent call last): File "train.py", line 319, in main(args) File "train.py", line 67, in main main_worker(args.gpu, ngpus_per_node, args) File "train.py", line 102, in main_worker if args.rank % ngpus_per_node == 0: ZeroDivisionError: integer division or modulo by zero

Adnan-Khan7 avatar Nov 09 '22 19:11 Adnan-Khan7

Dear Lee, any comments on the above-stated error?

Adnan-Khan7 avatar Nov 13 '22 17:11 Adnan-Khan7