Results change during testing / evaluation
Hii authors @quantaji , thanks for providing the code. I used code from https://huggingface.co/labelmaker/PTv3-ARKit-LabelMaker/tree/main/scannet200/insseg-pointgroup-v1m1-pt-v3m1-ppt-ft to finetune PTv3. But I found the results will change during testing / evaluation.
Here is my config:
from PTv3.code.pointcept.models.point_prompt_training.point_prompt_training_v1m2_decoupled import PointPromptTraining
from PTv3.code.pointcept.models.point_transformer_v3.point_transformer_v3m1_base import PointTransformerV3
backbone=dict(
in_channels=15, # originally this is 6, but I need to use a different type of input so input dim is now 15
order=('z', 'z-trans', 'hilbert', 'hilbert-trans'),
stride=(2, 2, 2, 2),
enc_depths=(3, 3, 3, 6, 3),
enc_channels=(48, 96, 192, 384, 512),
enc_num_head=(3, 6, 12, 24, 32),
enc_patch_size=(1024, 1024, 1024, 1024, 1024),
dec_depths=(3, 3, 3, 3),
dec_channels=(64, 96, 192, 384),
dec_num_head=(4, 6, 12, 24),
dec_patch_size=(1024, 1024, 1024, 1024),
mlp_ratio=4,
qkv_bias=True,
qk_scale=None,
attn_drop=0.0,
proj_drop=0.0,
drop_path=0.3,
shuffle_orders=True,
pre_norm=True,
enable_rpe=False,
enable_flash=True,
upcast_attention=False,
upcast_softmax=False,
cls_mode=False,
pdnorm_bn=True,
pdnorm_ln=True,
pdnorm_decouple=True,
pdnorm_adaptive=False,
pdnorm_affine=True,
pdnorm_conditions=('ScanNet', 'ScanNet200', 'ScanNet++',
'Structured3D', 'ALC'))
backbone = PointTransformerV3(**backbone)
model = dict(
backbone=backbone,
context_channels=256,
conditions=('ScanNet', 'ScanNet200', 'ScanNet++', 'Structured3D', 'ALC'),
num_classes=(20, 200, 100, 25, 185),
backbone_mode=True, # I only use this to extract features
)
self.segmentor = PointPromptTraining(**model)
And obtain the results by
ptv3_output = self.segmentor(data_dict) # [total_num_points, 64]
I changed PointPromptTraining to be a feature extractor only (which corresponds to setting backbone_mode=True). I made sure all modules self.training is False and all parameters requires_grad is False. Also, I made sure data_dict is not changed. Then, when I run the above code several times, it gives totally different results which is definitely not due to floating precision sort of stuff.
I am wondering what I was missing here? How do we obtain a deterministic result during testing / evaluation?
Many thanks!
I remember in the pre-processing code of pointcept, they use some sort of sampling as data augmentation. Therefore, the evaluation score during the training is noisy. Please check in detail the data config you use, I hope this could be the possible solution. Otherwise, I also have no idea.
Hii @quantaji Thanks for your response. So you meant that the model itself should not have any randomness during testing, correct? Just considering the example above, I made sure the data_dict is the same across all runs but still get different results. So this is unlikely a data augmentation issue though. But still thank you very much!
I am sorry I cannot help you with this as the code of PTv3 and PPT are contributed from the original authors.