CLIP
CLIP copied to clipboard
Can a text-guided model focus on the features of a specific area in an image?
For example, if we input a facial image, can the text-guided network focus on the mouth area? Is this achievable?
is the intention here to describe it? as in, do you want to get labels or a description for the mouth area?