Simplify image description in `visual.py` and `segment.py`
Feature request
In VisualReplayStrategy and SegmentReplayStrategy, segment description is currently formulated as:
image -> masks -> masked_images -> masked_image_descriptions = prompt("describe these images") -> active_segment_description (for mouse events only) -> prompt("given <masked_image_descriptions>,<active_segment_description>, ...: generate the next action") -> modified_active_segment_description -> modified_segment_coordinates
A simpler version worth trying:
image -> masks -> masked_images -> masked_image_descriptions = prompt("describe these images") -> active_segment_description (for mouse events only) -> prompt("given <masked_image_descriptions>,<active_segment_description>, and their coordinates,...: generate the next action")
i.e. have the model return coordinates, given segment descriptions
Motivation
Simplify, leverage future model performance improvements