RetroExplainer icon indicating copy to clipboard operation
RetroExplainer copied to clipboard

ct_prob和batch[ct_target]数量不匹配

Open starhou opened this issue 2 years ago • 9 comments

raise ValueError(f"Target size ({target.size()}) must be the same as input size ({input.size()})") ValueError: Target size (torch.Size([256, 300, 10])) must be the same as input size (torch.Size([256, 300, 4])) 我使用的参数为 python entry.py --batch_size 512 --acc_batches 1 --d_model 128 --dim_feedforward 256 --gpus 2 --epochs 2000 --dropout 0.2 --warmup_updates 2000 --tot_updates 1000000 --dataset data/USPTO50K --known_rxn_type --norm_first --nhead 32 --num_shared_layer 6 --num_rc_layer 0 --num_lg_layer 6 --num_ct_layer 6 --num_h_layer 6 --seed 123 --cuda 2 --max_ct_atom 4 --max_single_hop 4 打印出来两个结果确实是不同维度 ct_prob--> torch.Size([256, 300, 4]) batch[ct_target]--> torch.Size([256, 300, 10])

starhou avatar Dec 15 '23 08:12 starhou

你好,请问你是怎么解决这个问题的呢?

yzlbig avatar Mar 15 '24 13:03 yzlbig

我也遇到了相同的问题,希望您能分享一下解法,万分感谢!

yzlbig avatar Mar 15 '24 13:03 yzlbig

image 把这个默认的参数4改成10就行

Long-Nicholas avatar Sep 24 '24 16:09 Long-Nicholas

image 把这个默认的参数4改成10就行

十分感谢!但是为什么我运行一直都报out of memory 的错误, 我用的是250G内存的机器,这个也不够吗?

另外,请问你可以成功训练了吗?

feiyang-cai avatar Sep 25 '24 03:09 feiyang-cai

image 把这个默认的参数4改成10就行

十分感谢!但是为什么我运行一直都报out of memory 的错误, 我用的是250G内存的机器,这个也不够吗?

另外,请问你可以成功训练了吗?

是不是显存不够(Problem with memory of GPU )?(I guess)

现在还没有成功训练(not yet successful, it seems have other problems with the source code)

正尝试修复(fixiiiiiiiiiiiiiiiiiiiing)😔

Long-Nicholas avatar Sep 25 '24 03:09 Long-Nicholas

image 把这个默认的参数4改成10就行

十分感谢!但是为什么我运行一直都报out of memory 的错误, 我用的是250G内存的机器,这个也不够吗? 另外,请问你可以成功训练了吗?

是不是显存不够(Problem with memory of GPU )?(I guess)

现在还没有成功训练(not yet successful, it seems have other problems with the source code)

正尝试修复(fixiiiiiiiiiiiiiiiiiiiing)😔

it's out of the RAM since the "number of worker" is too large.. It passed when I set "num of worker" to 1.

Another problem is the batch size is too large, and I cannot run with 512 batch size in my A100 (80GB VRAM) machine.

I set the batch size to 64, and acc_batches to 8, and then can run the code.

It's already run 9 hours but only 14 epoch till now...

feiyang-cai avatar Sep 25 '24 13:09 feiyang-cai

image 把这个默认的参数4改成10就行

十分感谢!但是为什么我运行一直都报out of memory 的错误, 我用的是250G内存的机器,这个也不够吗? 另外,请问你可以成功训练了吗?

是不是显存不够(Problem with memory of GPU )?(I guess) 现在还没有成功训练(not yet successful, it seems have other problems with the source code) 正尝试修复(fixiiiiiiiiiiiiiiiiiiiing)😔

it's out of the RAM since the "number of worker" is too large.. It passed when I set "num of worker" to 1.

Another problem is the batch size is too large, and I cannot run with 512 batch size in my A100 (80GB VRAM) machine.

I set the batch size to 64, and acc_batches to 8, and then can run the code.

It's already run 9 hours but only 14 epoch till now...

请问您最后训练模型用了多长时间?我也是跑了9个小时才14个epoch...我的batch_size也是改到64,用的卡是A6000

lanna0504 avatar Apr 23 '25 06:04 lanna0504

抱歉,时间隔得太久了,我不大能记清了。我应该没有跑完吧,我刚才看了一眼这个training script里面写的2000个epoch,我估计当时训练了太久仅仅只有10几个epoch就花了10个小时,我就没训练了吧。我觉得应该不是机器的问题,应该就是code base有些bug。

feiyang-cai avatar Apr 23 '25 11:04 feiyang-cai

抱歉,时间隔得太久了,我不大能记清了。我应该没有跑完吧,我刚才看了一眼这个training script里面写的2000个epoch,我估计当时训练了太久仅仅只有10几个epoch就花了10个小时,我就没训练了吧。我觉得应该不是机器的问题,应该就是code base有些bug。

噢噢好的,谢谢

lanna0504 avatar Apr 23 '25 14:04 lanna0504