datumaro
datumaro copied to clipboard
Images lost when updating dataset created from a cvat task
I have two CVAT tasks with overlapping, but not fully identical set of images. When I use update method to merge those two datasets it seems some images are lost.
Here is a code snippet to demonstrate the issue:
ds1 = dm.Dataset.import_from("dataset1", "cvat")
print(f"Size of ds1: {len(ds1)}")
ds2 = dm.Dataset.import_from("dataset2", "cvat")
print(f"Size of ds2: {len(ds2)}")
ds3 = ds1.update(ds2)
print(f"Size of ds3 (before export): {len(ds3)}")
if os.path.exists("dataset3"):
# Rm data for clean experiment
shutil.rmtree("dataset3")
ds3.export("dataset3", "cvat", save_media=True)
# Import dataset again:
ds3 = dm.Dataset.import_from("dataset3", "cvat")
print(f"Size of ds3 (after import): {len(ds3)}")
For the code above I get the following output:
Size of ds1: 3
Size of ds2: 2
Size of ds3 (before export): 3
Size of ds3 (after import): 2
It seems the issue is caused by the id attribute of the image tag inside CVAT annotations.xml file. If the same filename have different id in two datasets then it seems that value is lost. I've managed to solve the issue by manually overriding some attributes inside each item of the ds3 before the export:
for idx, item in enumerate(ds3):
item.attributes["frame"] = idx
P.S.: I've created a separate repository with the full code and data to reproduce the issue