visual_prompting
visual_prompting copied to clipboard
Input Grids and Support pairs
Hello. I have a clarification question regarding input grids and support pairs.
It looks like the model always works with a 224×224 input image, which is tokenized into 14×14 patches. If we want to include more number of support pairs (more rows) it seems we have to fit them within the 14×14 patches which means there will be tradeoff between number of support pairs and image resolution. Is it correct? And if so I have a question regarding the figure in paper where it's shown that more examples,better results. Were the 5 support pairs in the grid as shown in the figure within these 14×14 patches (lower resolution per image) and the model still produced better results?
Thanks!