RetroExplainer ct_prob和batch[ct_target]数量不匹配

raise ValueError(f"Target size ({target.size()}) must be the same as input size ({input.size()})") ValueError: Target size (torch.Size([256, 300, 10])) must be the same as input size (torch.Size([256, 300, 4])) 我使用的参数为 python entry.py --batch_size 512 --acc_batches 1 --d_model 128 --dim_feedforward 256 --gpus 2 --epochs 2000 --dropout 0.2 --warmup_updates 2000 --tot_updates 1000000 --dataset data/USPTO50K --known_rxn_type --norm_first --nhead 32 --num_shared_layer 6 --num_rc_layer 0 --num_lg_layer 6 --num_ct_layer 6 --num_h_layer 6 --seed 123 --cuda 2 --max_ct_atom 4 --max_single_hop 4 打印出来两个结果确实是不同维度 ct_prob--> torch.Size([256, 300, 4]) batch[ct_target]--> torch.Size([256, 300, 10])

Dec 15 '23 08:12 starhou

你好，请问你是怎么解决这个问题的呢？

Mar 15 '24 13:03 yzlbig

我也遇到了相同的问题，希望您能分享一下解法，万分感谢！

Mar 15 '24 13:03 yzlbig

把这个默认的参数4改成10就行

Sep 24 '24 16:09 Long-Nicholas

把这个默认的参数4改成10就行

十分感谢！但是为什么我运行一直都报out of memory 的错误，我用的是250G内存的机器，这个也不够吗？

另外，请问你可以成功训练了吗？

Sep 25 '24 03:09 feiyang-cai

把这个默认的参数4改成10就行

十分感谢！但是为什么我运行一直都报out of memory 的错误，我用的是250G内存的机器，这个也不够吗？

另外，请问你可以成功训练了吗？

是不是显存不够（Problem with memory of GPU ）？（I guess）

现在还没有成功训练（not yet successful, it seems have other problems with the source code）

正尝试修复（fixiiiiiiiiiiiiiiiiiiiing）😔

Sep 25 '24 03:09 Long-Nicholas

把这个默认的参数4改成10就行

十分感谢！但是为什么我运行一直都报out of memory 的错误，我用的是250G内存的机器，这个也不够吗？另外，请问你可以成功训练了吗？

是不是显存不够（Problem with memory of GPU ）？（I guess）

现在还没有成功训练（not yet successful, it seems have other problems with the source code）

正尝试修复（fixiiiiiiiiiiiiiiiiiiiing）😔

it's out of the RAM since the "number of worker" is too large.. It passed when I set "num of worker" to 1.

Another problem is the batch size is too large, and I cannot run with 512 batch size in my A100 (80GB VRAM) machine.

I set the batch size to 64, and acc_batches to 8, and then can run the code.

It's already run 9 hours but only 14 epoch till now...

Sep 25 '24 13:09 feiyang-cai

把这个默认的参数4改成10就行

十分感谢！但是为什么我运行一直都报out of memory 的错误，我用的是250G内存的机器，这个也不够吗？另外，请问你可以成功训练了吗？

是不是显存不够（Problem with memory of GPU ）？（I guess）现在还没有成功训练（not yet successful, it seems have other problems with the source code）正尝试修复（fixiiiiiiiiiiiiiiiiiiiing）😔

it's out of the RAM since the "number of worker" is too large.. It passed when I set "num of worker" to 1.

Another problem is the batch size is too large, and I cannot run with 512 batch size in my A100 (80GB VRAM) machine.

I set the batch size to 64, and acc_batches to 8, and then can run the code.

It's already run 9 hours but only 14 epoch till now...

请问您最后训练模型用了多长时间？我也是跑了9个小时才14个epoch...我的batch_size也是改到64，用的卡是A6000

Apr 23 '25 06:04 lanna0504

抱歉，时间隔得太久了，我不大能记清了。我应该没有跑完吧，我刚才看了一眼这个training script里面写的2000个epoch，我估计当时训练了太久仅仅只有10几个epoch就花了10个小时，我就没训练了吧。我觉得应该不是机器的问题，应该就是code base有些bug。

Apr 23 '25 11:04 feiyang-cai

抱歉，时间隔得太久了，我不大能记清了。我应该没有跑完吧，我刚才看了一眼这个training script里面写的2000个epoch，我估计当时训练了太久仅仅只有10几个epoch就花了10个小时，我就没训练了吧。我觉得应该不是机器的问题，应该就是code base有些bug。

噢噢好的，谢谢

Apr 23 '25 14:04 lanna0504