When training on tesla v100, e.g.,
The training on VG dataset can be fed with 12 images at a time, however, it seems one card can only validate one image at a time during the validation process? Is there any chance to validate 12 images at one time during validation?
Training .sh
python tools/relation_train_net.py \ --config-file "configs/e2e_relBGNN_vg.yaml" \ DEBUG False \ EXPERIMENT_NAME "BGNN-PreCls" \ SOLVER.IMS_PER_BATCH $[3*4] \ TEST.IMS_PER_BATCH $[4] \ SOLVER.VAL_PERIOD 3000 \ SOLVER.CHECKPOINT_PERIOD 3000 \ MODEL.ROI_RELATION_HEAD.USE_GT_BOX True \ MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL True \
Problem encountered:
``
instance name: sgdet-BGNNPredictor/(2022-07-01_13)BGNN-PreCls(resampling)
elapsed time: 0:06:51
eta: 3 days, 7:48:18
iter: 100/70000
loss: 0.6129 (0.7214)
loss_rel: 0.1183 (0.1323)
pre_rel_classify_loss_iter-0: 0.1641 (0.2069)
pre_rel_classify_loss_iter-1: 0.1628 (0.1891)
pre_rel_classify_loss_iter-2: 0.1618 (0.1932)
time: 3.9448 (4.1101)
data: 0.0559 (0.0689)
lr: 0.026707
max mem: 19994
[07/01 13:31:28 pysgg]: relness module pretraining..
[07/01 13:31:28 pysgg]: Start validating
[07/01 13:31:28 pysgg]: Start evaluation on VG_stanford_filtered_with_attribute_val dataset(5000 images).
0%| | 0/417 [00:06<?, ?it/s]
Traceback (most recent call last):
File "tools/relation_train_net.py", line 714, in
main()
File "tools/relation_train_net.py", line 705, in main
model = train(cfg, args.local_rank, args.distributed, logger)
File "tools/relation_train_net.py", line 496, in train
val_result = run_val(cfg, model, val_data_loaders, distributed, logger)
File "tools/relation_train_net.py", line 565, in run_val
logger=logger,
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/engine/inference.py", line 123, in inference
timer=inference_timer, logger=logger)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/engine/inference.py", line 41, in compute_on_dataset
output = model(images.to(device), targets, logger=logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/roi_heads.py", line 69, in forward
x, detections, loss_relation = self.relation(features, detections, targets, logger)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/relation_head.py", line 215, in forward
logger,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/roi_relation_predictors.py", line 604, in forward
roi_features, union_features, inst_proposals, rel_pair_idxs, rel_binarys, logger
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/model_bgnn.py", line 796, in forward
rel_pair_inds,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/lintianlin_group_v100/lgzhou/scene_graph_generation/bgnn/pysgg/modeling/roi_heads/relation_head/model_msg_passing.py", line 261, in forward
obj_embed_by_pred_dist = self.obj_embed_on_prob_dist(obj_labels.long())
AttributeError: 'NoneType' object has no attribute 'long'
``