teach The same action prediction gets different evaluation metrics in different runs

Hi,

I ran the baseline ET model and found that two different runs get significantly different evaluation metrics. (might relate to this issue #10) Run1:

SR: 77/608 = 0.127
GC: 487/3526 = 0.138
PLW SR: 0.026
PLW GC: 0.093

Run2:

SR: 52/608 = 0.086
GC: 321/3526 = 0.091
PLW SR: 0.007
PLW GC: 0.034

After taking a close look at the output I find in some episodes the same set of prediction actions results in different evaluation metrics in different runs. For example in this 66957a984ae5a714_f28d.edh4, the inference output for the first run is:

"66957a984ae5a714_f28d.edh4": {
        "instance_id": "66957a984ae5a714_f28d.edh4",
        "game_id": "66957a984ae5a714_f28d",
        "completed_goal_conditions": 2,
        "total_goal_conditions": 2,
        "goal_condition_success": 1,
        "success_spl": 0.55,
        "path_len_weighted_success_spl": 12.100000000000001,
        "goal_condition_spl": 0.55,
        "path_len_weighted_goal_condition_spl": 12.100000000000001,
        "gt_path_len": 22,
        "reward": 0,
        "success": 1,
        "traj_len": 40,
        "predicted_stop": 0,
        "num_api_fails": 30,
        "error": 0,
        "init_success": true,
        "pred_actions": [
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ]
        ]
    }

While for the second run it is:

"66957a984ae5a714_f28d.edh4": {
        "instance_id": "66957a984ae5a714_f28d.edh4",
        "game_id": "66957a984ae5a714_f28d",
        "completed_goal_conditions": 0,
        "total_goal_conditions": 2,
        "goal_condition_success": 0.0,
        "success_spl": 0.0,
        "path_len_weighted_success_spl": 0.0,
        "goal_condition_spl": 0.0,
        "path_len_weighted_goal_condition_spl": 0.0,
        "gt_path_len": 22,
        "reward": 0.0,
        "success": 0,
        "traj_len": 40,
        "predicted_stop": 0,
        "num_api_fails": 30,
        "error": 0,
        "init_success": true,
        "pred_actions": [
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ]
        ]
    }

So basically the first evaluation result does not make sense since there should be no chance for the model to succeed without performing any manipulative actions.

The first run is done using an AWS ec2 p3.8 instance while the second run using a p3.16. All the other settings are the same. The full evaluation logs are available here: [run 1] [run 2]

Do you have any idea about the cause? Thanks

Jun 14 '22 16:06 594zyc

Hi, can you share the inference and pred_actions files generated for that particular EDH instance in the two runs. I need those to debug issues with metrics calculation.

Also I can understand why you think the bug is with metrics but just so that I have enough info to reproduce this, can you tell me whether you trained a new ET model or used one of the released checkpoints?

Also, could you share the exact commands used to run inference in the two cases? Did anything change (for example random seed)? I realize randomness is probably not the issue here but if the bug is elsewhere resulting in a corrupted agent state, I would need to be able to replicate how the exact agent state you got in both cases gets created.

Jun 17 '22 17:06 aishwaryap

Hi,

Sorry for the late response! The inference and prediction files are available here.

The inference is based on your released ET model using the following command:

teach_inference \
    --data_dir $DATA_DIR \
    --output_dir /home/ubuntu/teach-eval/et/predictions \
    --metrics_file /home/ubuntu/teach-eval/et/metrics/metrics_seen.txt \
    --images_dir $IMAGE_DIR \
    --split valid_seen \
    --model_module teach.inference.et_model \
    --model_class ETModel \
    --model_dir ./models/baseline_models/et \
    --visual_checkpoint ./models/et_pretrained_models/fasterrcnn_model.pth \
    --object_predictor ./models/et_pretrained_models/maskrcnn_model.pth \
    --seed 4 \
    --num_processes 16

Jun 29 '22 20:06 594zyc

Hi @594zyc apologies for the delay in response. After taking a look at your inference files, I think this behaviour is likely caused by a bug we also found internally where some object properties do not get properly reset between episodes. We have fixed this in commit 974b3f1013e1cacc4b21d6eb65e84a1d33f82c18 so hopefully if you pull the latest mainline, you shouldn't see this issue anymore. I would appreciate it if you can update here after you test this so that I can decide whether this issue has been fully resolved.

Jul 05 '22 21:07 aishwaryap