datasets
why this datasets have both vis and lang datasets ,and the vis datasets are without language annotations . I am curious about the benchmark results are trained and tested on what dataset
The language dataset is 1% of the vision dataset that has been labeled with language instructions. In Multi-Context Imitation Learning, the agent is trained with different goal modalities, either a goal image or a goal language instruction. The training is performed on both datasets (which is why we use the combined dataloader and the evaluation uses exclusively language instructions (= language goals). It seems to me that you still haven't read the original papers that I linked in your original issue in the hulc repo.