mx-DeepIM icon indicating copy to clipboard operation
mx-DeepIM copied to clipboard

training log

Open vshk42 opened this issue 6 years ago • 2 comments

Hi,

Would you be able to provide the training log especially interested in the point matching loss after few epochs. I'm training it on google compute VM with 4 Nvidia Tesla K80 GPUs but the training speed is about 2.2 frames/sec. The point matching loss is between 10-11 for last few hour (although the flowLoss and the maskLoss is showing a good downward trend).

poch[3] Batch [260] Speed: 2.26 samples/sec Train-Flow_L2Loss=0.321852, Flow_CurLoss=0.000000, PointMatchingLoss=10.396450, MaskLoss=0.108871, Epoch[3] Batch [280] Speed: 2.33 samples/sec Train-Flow_L2Loss=0.309070, Flow_CurLoss=0.000000, PointMatchingLoss=10.542913, MaskLoss=0.109692, Epoch[3] Batch [300] Speed: 2.31 samples/sec Train-Flow_L2Loss=0.309031, Flow_CurLoss=0.000000, PointMatchingLoss=10.530530, MaskLoss=0.109173, Epoch[3] Batch [320] Speed: 2.36 samples/sec Train-Flow_L2Loss=0.299514, Flow_CurLoss=0.000000, PointMatchingLoss=10.616660, MaskLoss=0.108448, Epoch[3] Batch [340] Speed: 2.37 samples/sec Train-Flow_L2Loss=0.291039, Flow_CurLoss=0.000000, PointMatchingLoss=10.707121, MaskLoss=0.107355, Epoch[3] Batch [360] Speed: 2.34 samples/sec Train-Flow_L2Loss=0.280393, Flow_CurLoss=0.000000, PointMatchingLoss=10.794026, MaskLoss=0.107118, Epoch[3] Batch [380] Speed: 2.35 samples/sec Train-Flow_L2Loss=0.276359, Flow_CurLoss=0.202416, PointMatchingLoss=10.857506, MaskLoss=0.110217, Epoch[3] Batch [400] Speed: 2.33 samples/sec Train-Flow_L2Loss=0.265881, Flow_CurLoss=0.000000, PointMatchingLoss=11.151155, MaskLoss=0.109521, Epoch[3] Batch [420] Speed: 2.36 samples/sec Train-Flow_L2Loss=0.256921, Flow_CurLoss=0.000000, PointMatchingLoss=11.267041, MaskLoss=0.110485, Epoch[3] Batch [440] Speed: 2.32 samples/sec Train-Flow_L2Loss=0.250065, Flow_CurLoss=0.000000, PointMatchingLoss=11.396601, MaskLoss=0.111172, Epoch[3] Batch [460] Speed: 2.32 samples/sec Train-Flow_L2Loss=0.244040, Flow_CurLoss=0.000000, PointMatchingLoss=11.543343, MaskLoss=0.110957, Epoch[3] Batch [480] Speed: 2.31 samples/sec Train-Flow_L2Loss=0.239315, Flow_CurLoss=0.000000, PointMatchingLoss=11.643859, MaskLoss=0.114532,

vshk42 avatar May 26 '19 07:05 vshk42

The training finished for train_and_test_deepim_ape.sh. I'm getting some error during the test which I'm debugging. raceback (most recent call last): File "experiments/deepim/deepim_train_test.py", line 22, in test.main() File "experiments/deepim/../../deepim/test.py", line 210, in main test_deepim() File "experiments/deepim/../../deepim/test.py", line 203, in test_deepim pairdb=pairdb, File "experiments/deepim/../../deepim/core/tester.py", line 590, in pred_eval data_batch = update_data_batch(config, data_batch, update_package) File "experiments/deepim/../../lib/pair_matching/data_pair.py", line 79, in update_data_batch package = update_package[ctx_idx] IndexError: list index out of range

I will update once the test runs. Attached are the plots for PointMatchingloss/lr plots, does it match with what you observe?

image

vshk42 avatar May 26 '19 18:05 vshk42

I'm facing the same kind of behavior. All metrics show a significant decrease while the point matching loss seems to increase at the same time.

huberl avatar Jun 11 '19 08:06 huberl