teach Much higher scores when evaluating Episodic Transformer baselines for EDH instances

Hello,

I have finished the evaluation of the Episodic Transformer baselines for the TEACh Benchmark Challenge on the valid_seen.

However, one weird thing I found is that our reproduced result is much higher than what is reported in the paper. The result is shown below (All values are percentages). There is a total of 608 EDH instances (valid_seen) in the metric file which matches the number in the paper.

	SR [TLW]	GC [TLW]
Reproduced	13.8 [3.2]	14 [8.7]
Reported in the paper	5.76 [0.90]	7.99 [1.65]

I believe I am using the correct checkpoints. And the only change I made to the code is mentioned in #9.

I am running on an AWS instance. I have started the X-server and installed all requirements and prerequisites without bugs. And the inference process is bugfree.

Here is the script I used for evaluation.

#!/bin/sh

export AWS_ROOT=/home/ubuntu/workplace
export ET_DATA=$AWS_ROOT/data
export TEACH_ROOT_DIR=$AWS_ROOT/teach
export TEACH_SRC_DIR=$TEACH_ROOT_DIR/src
export ET_ROOT=$TEACH_SRC_DIR/teach/modeling/ET
export ET_LOGS=$TEACH_ROOT_DIR/src/teach/modeling/ET/checkpoints
export INFERENCE_OUTPUT_PATH=$TEACH_ROOT_DIR/inference_output
export PYTHONPATH=$TEACH_SRC_DIR:$ET_ROOT:$PYTHONPATH
export SPLIT=valid_seen

cd $TEACH_ROOT_DIR
python src/teach/cli/inference.py \
            --model_module teach.inference.et_model \
                --model_class ETModel \
                    --data_dir $ET_DATA \
                        --output_dir $INFERENCE_OUTPUT_PATH/inference__teach_et_trial_$SPLIT \
                            --split $SPLIT \
                                --metrics_file $INFERENCE_OUTPUT_PATH/metrics__teach_et_trial_$SPLIT.json \
                                    --seed 4 \
                                        --model_dir $ET_DATA/baseline_models/et \
                                            --object_predictor $ET_LOGS/pretrained/maskrcnn_model.pth \
                                            --visual_checkpoint $ET_LOGS/pretrained/fasterrcnn_model.pth \
                                                --device "cpu" \
                                                --images_dir $INFERENCE_OUTPUT_PATH/images

I wonder if the data split provided in the dataset is the same as the paper. And if so, what would be the possible explanation for this?

Please let me know if someone else is getting similar results. Thank you!

Feb 03 '22 14:02 yingShen-ys

Hi @yingShen-ys

That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it. While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().

Best, Aishwarya

Feb 07 '22 22:02 aishwaryap

Hi @yingShen-ys

That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it. While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().

Best, Aishwarya

Got it. Thank you for the clarification.

Feb 08 '22 20:02 yingShen-ys

Hi @yingShen-ys That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it. While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance(). Best, Aishwarya

Got it. Thank you for the clarification.

We got a similar result, and actually the results can be significantly different when training on different machines.

Feb 10 '22 07:02 Ji4chenLi

I believe differences are less likely to be due to the machine used for training but rather the effect of the random seed. We have also seen this behavior with training the ET model on ALFRED.

Feb 12 '22 04:02 aishwaryap