LL3DA Reproduction of QA Task Issues

Hello there! I'm interested in your work, but I'm having some differences when reproducing the results of the paper. So, I'd like to consult with you.

In the QA task, is it true that the result of 74.80 tested by bash eval.generalist.sh is for ll3da-generalist/checkpoint.pth rather than ll3da-generalist/checkpoint_best.pth?
Similarly, for the fine-tuned 76.79 result, is it evaluated using eval.scanqa.sh for ll3da-scanqa-tuned/checkpoint.pth or ll3da-scanqa-tuned/checkpoint_best.pth?
Another question is about the visual prompts in Table 4 and Table 8. What are the differences, and why does Table 4 only show an effectiveness of 74.80 after adding visual prompts? Did the fine-tuning phase also use these two Text and Visual prompts? The fine-tuned result is only 76.69, not reaching the 82.91 in Table 8. Where in the code corresponds to Table 8? Currently, I only found the following about the click part in unified_scanqa.py.

Apr 27 '24 02:04 gaohan-cmd

We always evaluate our method with the checkpoint_best.pth.
The evaluations are similar to 1. However there is a slight difference between our released codebase and the main paper. The reported results are trained with all the 3D-LLM data, regardless of duplications. Meanwhile, we drop the duplicates in our released codebase.
Table 8 shows the effectiveness of "test-time" visual prompts, while other tables evaluates the model with text-only interactions.

Apr 27 '24 12:04 ch3cook-fdu

We always evaluate our method with the checkpoint_best.pth.

The evaluations are similar to 1. However there is a slight difference between our released codebase and the main paper. The reported results are trained with all the 3D-LLM data, regardless of duplications. Meanwhile, we drop the duplicates in our released codebase.

Table 8 shows the effectiveness of "test-time" visual prompts, while other tables evaluates the model with text-only interactions.

Thank you very much for your response! But I still have some questions:

For answer two, does "The reported results are trained with all the 3D-LLM data" mean that when I run bash scripts/opt-1.3b/train.generalist.sh, I only need to use the datasets unified_3dllm_scene_description, unified_3dllm_embodied_dialogue, unified_3dllm_embodied_planning, and the rest of the datasets are only used during fine-tuning? Regarding answer three, how are "test-time" visual prompts specifically implemented in the code? Are visual prompts operations like clik and _encode_box_coords in the unified_scanqa.py file? How can I easily control whether to use visual prompts or text prompts during testing?

Apr 27 '24 13:04 gaohan-cmd

More comment on:

Q2 - No, the all the 3D-LLM data refers to using all the ScanNet part of 3D-LLM before data cleansing, which might contain duplicated training samples. We have not released this copy of data.

Q3 - For quantitative results for row-2 in Table 8, we naively use all the object-id annotations for both training and evaluation, since the original annotations selects more objects than what's related to the question. We have not released that code either. Indeed, the text instructions are required while the visual prompts are optional, and only adopted in tasks like ScanQA, 3D dense captioning, and 3D open-vocabulary detection.

Apr 27 '24 14:04 ch3cook-fdu

More comment on:

Q2 - No, the all the 3D-LLM data refers to using all the ScanNet part of 3D-LLM before data cleansing, which might contain duplicated training samples. We have not released this copy of data.

Q3 - For quantitative results for row-2 in Table 8, we naively use all the object-id annotations for both training and evaluation, since the original annotations selects more objects than what's related to the question. We have not released that code either. Indeed, the text instructions are required while the visual prompts are optional, and only adopted in tasks like ScanQA, 3D dense captioning, and 3D open-vocabulary detection.

OK, thank you for your answer 😊

Apr 28 '24 05:04 gaohan-cmd