Related: https://github.com/iterative/dvc/issues/10198, https://github.com/iterative/vscode-dvc/issues/4917

We need a way to log bounding boxes (and maybe later other annotations like segmentation masks) for images saved with dvclive.

p1

The API can look like this:

boxes = [
  {"label": "cat", "box": {"x_min": 100, "x_max": 110, "y_min": 5, "y_max": 20}},
  {"label": "cat", "box": {"x_min": 30, "x_max": 55, "y_min": 75, "y_max": 90}},
  {"label": "dog", "box": {"x_min": 80, "x_max": 100, "y_min": 25, "y_max": 50}}
]
live.log_image("myimg.png", myimg, boxes=boxes)

In addition to saving the image to dvclive/plots/images/myimg.png, this will also save annotations to dvclive/plots/images/myimg.json in the following format:

{"boxes":
  [
    {"label": "cat", "box": {"x_min": 100, "x_max": 110, "y_min": 5, "y_max": 20}},
    {"label": "cat", "box": {"x_min": 30, "x_max": 55, "y_min": 75, "y_max": 90}},
    {"label": "dog", "box": {"x_min": 80, "x_max": 100, "y_min": 25, "y_max": 50}}
  ]
}

p2:

[ ] Other box formats (using width, height, and x/y for the center/corner) ({"x_center": 100, "y_center": 50, "width": 10, "height": 20})
[ ] Normalized coordinates (between 0 and 1) instead of pixel coordinates (we could probably auto-detect this)
[ ] Scores ("scores": {"acc": 0.9, "loss": 0.05}) so that users can filter boxes based on thresholds (only show boxes where acc > 0.8)
[ ] Segmentations masks (tbd, requires a class per pixel)

Jan 23 '24 22:01 dberenbaum

May need to consider whether it's necessary to list the universe of labels somewhere or if it's fine to parse them as the set of all individual labels.

Jan 23 '24 22:01 dberenbaum

I'm not 100% sure this is the right place for my first discussion on a feature. But I'll jump on that one and remove my comment if it is not the correct way of dealing with feature discussion internally.

I would advise using "left," "top," "right," and "bottom" instead of "x" and "y" notations. The first one leaves no ambiguity, while the second is quite ambiguous. First, because it depends on what you consider x and y to be (images can be seen as a matrix (x is vertical, and y is horizontal) or as a plot (x is horizontal, and y is vertical). Then, "x_min" and "x_max" depend on your reference point. For instance, torchvision and Shapely don't have the same. The first considers the top left of the image to be the reference, and the second considers the bottom left (It is the same debate as matrix vs plot). Honestly, after many years working on object detection, the only format that never confused us was "left," "top," "right," and "bottom". While I agree that the user interface should have several options, internally, I can't recommend enough that we use a nonambiguous notation.
You mentioned a "score" feature, which is a great idea. In my opinion, it should probably be in P1, actually. There are so many detections out of a detection model that they only make sense if you have a score attached. What could be very interesting to have a threshold set by class in the visual interface. Usually, some classes are more represented than others, so the threshold you want to set for each class can be very different (for the same model, it could be 0.3 for rare classes and 0.95 for common classes).
I realized you wanted to give an example, but you don't usually have an accuracy score for each bounding box. The best most libraries out there give you is the confidence for the winning class, and only during the validation process (not during training). Indeed, during training, the model (or framework) will only return the loss for the all image. I would suggest we had this "score" at the same level of "label" and "box" and make it a float.
We should take advantage of other tools dealing with classification + detection + polygons + segmentation + multiclass like Supervisely (a labeling platform) or lightning-flash. Honestly, having a nice and intuitive data format for all these use cases is not trivial. We might benefit from looking at their data schema and eventually asking them what they would do differently if they could start over.

Feel free to tell me if I should have done this discussion differently or elsewhere. I'll act accordingly.

Feb 01 '24 10:02 AlexandreKempf

Great feedback @AlexandreKempf! Let's go with your suggestions here.

@mattseddon and @julieg18 have been working on this functionality, and you could work with them on getting this implemented. @AlexandreKempf is our newest ML product engineer who just joined the team.

Feb 01 '24 12:02 dberenbaum

@AlexandreKempf, great suggestions on this feature!

Feel free to take a look at https://github.com/iterative/vscode-dvc/pull/5227 if you'd like to give any feedback on the plots' current design and reach out if you have any questions about VSCode's or Studio's side of things.

Feb 02 '24 22:02 julieg18

TODO list for this project:

[x] DVClive should save the annotations in a .json file close to the image file
[x] DVC should display the annotations when running the query dvc plots diff --json --split so that VScode can read them
[x] VSCode should display the annotations
[ ] DVC should sent the annotations to Studio
[ ] DVC should sent the annotations to Studio for live experiments
[ ] Studio should display the annotations

Feb 27 '24 07:02 AlexandreKempf

`log_image`: log bounding boxes

p1

p2: