ImageBind Thermal x Vision Support

Following issue 14, I created a small example for thermal embedding. While the Vision x Text and Thermal x Text are working properly, it seems the Vision x Thermal does not yield the correct result.

def load_and_transform_thermal_data(thermal_paths, device):
    if image_paths is None:
        return None

    thermal_ouputs = []
    for thermal_path in thermal_paths:
        data_transform = transforms.Compose(
            [
                transforms.Resize(
                    224, interpolation=transforms.InterpolationMode.BICUBIC
                ),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
#                 transforms.Normalize(
#                     mean=(0.5),
#                     std=(0.5),
#                 ),
            ]
        )
        with open(thermal_path, "rb") as fopen:
            thermal = Image.open(fopen).convert("L")
        thermal = data_transform(thermal).to(device)
        thermal_ouputs.append(thermal)
    return torch.stack(thermal_ouputs, dim=0)

And the results are:

Vision x Text: 
 [[9.9997604e-01 2.3943641e-05]
 [6.0792509e-06 9.9999392e-01]]
Thermal x Text x : 
 [[1.0000000e+00 1.2433221e-11]
 [2.8220674e-02 9.7177935e-01]]
Vision x Thermal Cosine: 
 [[0.1554441  0.02945926]
 [0.16725276 0.03671783]]
Vision x Thermal Softmax: 
 [[0.7789999  0.22100005]
 [0.7867338  0.21326624]]

May 11 '23 13:05 Oringa

What dataset did you use for your thermal data? Did you use LLVIP in the paper?

May 23 '23 13:05 MilkClouds

Done?

Jun 07 '23 23:06 Watisup-byte

Where did you add the function "load_and_transform_thermal_data" exactly? I am facing a different issue though but this might help!, my issue is this: Given groups=1, weight of size [768, 1, 16, 16], expected input[3, 3, 224, 224] to have 1 channels, but got 3 channels instead

Thanks in advance!

Jul 05 '23 07:07 anantterkar

Also I think there is a typo in line 2, replace image_paths with thermal_paths

Jul 05 '23 10:07 anantterkar

Hi, here to recommend our work, which is LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. We open source all training and validation code.

LanguageBind can be disassembled into different branches to handle different tasks. print("Video x Audio: \n", torch.softmax(embeddings['video'] @ embeddings['audio'].T, dim=-1).detach().cpu().numpy()) print("Video x Thermal: \n", torch.softmax(embeddings['video'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy()) print("Image x Thermal: \n", torch.softmax(embeddings['image'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy()) print("Image x Depth: \n", torch.softmax(embeddings['image'] @ embeddings['depth'].T, dim=-1).detach().cpu().numpy())

Oct 16 '23 02:10 LinB203

@LinB203 I have tried your work, but I have run inference.py in the code multiple times and the output results are inconsistent each time. Therefore, I guess there may be an error somewhere. Please verify this issue.

Mar 15 '24 05:03 Jade999