[FEATURE] injest fiftyone datasets
🚨🚨 Feature Request
- [ ] Related to an existing Issue
- [x] A new implementation (Improvement, Extension)
Is your feature request related to a problem?
My problem is being able to ingest fiftyone datasets into deeplake
- exporting would also be an interesting addition
If your feature will improve HUB
Fiftyone is a common dataset import and export tool, integration with deeplake would make such operations easy, and would mean that we do not have to implement such operations from scratch.
Description of the possible solution
import deeplake
import fiftyone
# ideally this would be able to detect the various different types and labels and be able to import these accordingly.
dataset = fiftyone.load_dataset('my_dataset')
deeplake.ingest_51('deeplake_data/my_dataset', dataset)
An alternative solution to the problem can look like
Ingest steps could be written manually. (Fiftyone doesn't enforce much structure on the datasets so I am not sure if the original ingest function even has a distinct solution, maybe some basic structure would be required).
Teachability, Documentation, Adoption, Migration Strategy Needs discussion first
hey @nmichlo, thanks a lot for the feature request! I'm tagging @istranic for visibility and follow-up here. If you are feeling like it, we would welcome a contribution to this enhancement!
@nmichlo thanks a lot for opening the issue. Curious can you give us more context on the use case why would you like to import FiftyOne datasets? (what you like and don't like in FiftyOne?)
Use Case:
As part of my day to work I often need to find, download, import and pre-process many different existing datasets which are all usually in common formats like COCO or YOLOv5. Occasionally I will need to write a script to import a custom format, but I generally try and avoid that. These datasets are then often merged together or added to existing datasets that are then used for re-training. Improving models by iterating on the data, ultimately version control here would be great, which is why deep-lake is so appealing.
- Eventually I would love to transition from exporting to custom formats, to using the deeplake dataloaders themselves, however, often for experimentation it might still be necessary to export to these various common formats (eg. YOLOv5, COCO) to avoid code changes to external libs.
Fiftyone, the good and bad:
Disclaimer: my overall experience with fiftyone is still fairly limited, my main use case however is the import/export functionality, combined with the local preview of datasets, occasional dataset filtering and renaming/removing labels. Ultimately I would love to replace fiftyone entirely with deeplake, and store datasets in our own cloud buckets.
What is good about fiftyone:
- the built in import/export functionality of datasets, from/to many different common dataset formats.
- Often enables converting datasets for use existing STOA projects without modification to their source code.
- local dataset previews in the browser without jupyter notebook and external connections
- CVAT/LabelStudio integration for re-labelling of data
- extremely useful for iterative refinement (This would be an amazing feature for deeplake, if done correctly, this could really set it apart)
- tagging dataset items directly from the UI for use further down in scripts, eg. removal of problematic images.
What is bad about fiftyone:
- extremely slow start times, making it painful to use in scripts, this is due to the MongoDB backend which is heavily integrated and cannot be removed.
- type hints are not great across the project, IDE support is thus also not great making the project difficult to work with.
- structuring of items in datasets is much less intuitive bordering on unstructured.
- no versioning
- not intended for use as a dataloader, export of datasets is required.
- Some of the import/export formats are brittle and don't support the dataset standards entirely.
EDIT: overall, deeplake has been extremely refreshing to work with. Really good work on the project so far!
EDIT2: might be worth adding fiftyone to the README section on "Comparisons to Familiar Tools"?
EDIT3: I can provide examples of my own import fiftyone -> deeplake script, but it is definitely not general in any sense. It was tailored to a specific format, purely as a test.
Based on my clarified use case, I might even argue with my own issue, in that fiftyone injest would be a nice-to-have, and ultimately a better solution might be built in support for ingesting and exporting common dataset formats.
EDIT: this could also serve as a good way of documenting / providing examples of real-world use cases, that can be adapted.
Hey @nmichlo Thank you for the feedback. This is extremely useful for our product development.
As I was reading your comments, I had the same thought as your last note:
- "Based on my clarified use case, I might even argue with my own issue, in that fiftyone injest would be a nice-to-have, and ultimately a better solution might be built in support for ingesting and exporting common dataset formats."
Just want to clarify that I understand it correctly, because it appears aligned with our roadmap. Would you rather have a function to ingest from 51, or a set of function to ingest from dataset formats such as YOLO, COCO, CVAT, LabelStudio, and others?
@istranic no problem, glad to help!
Ideally in the long run I personally would prefer not to use fiftyone, and ingest/export datasets directly.
However, I think there might be merit for both?
- ingesting directly from fiftyone might keep additional information that would otherwise be discarded if there is first an export and then import step.
- this would also allow easier migration to deeplake
Got it. Thanks @nmichlo!