flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core feature] Support for File Families/Files With Indices Indices

Open CalvinLeather opened this issue 3 years ago • 1 comments

Motivation: Why do you think this is important?

In bioinformatics work especially, file families, or files with indices, are common. This occurs when multiple files with the same stem, but different suffixes, jointly contain a dataset. This occurs either because a single csv-like has a index allowing for efficient lookups (e.g., a .vcf file and a tabix-based .vcf index), or when a library stores a file in multiple parts, often because matrix/tabular data is stored seperately from metadata about rows and columns (for example, a plink .bed triple containing genetic information about a number of samples/people, where the .bed file stores the matrix of data, .fam stores information about samples in the data e.g. IDs of the people aka the columns, and the .bim stores informations about the sites/variants of DNA aka the rows.

Providing better support for these files will enhance the UX for people writing bioinformatics-pipelines with Flyte.

Goal: What should the final outcome look like, ideally?

The following should be true:

  • When using this input pattern in a shell task, the various components of the family are all put into the same path. This is required, as, for example, vcftools expects the index of a .vcf file to have the same stem (i.e., to be next to it with the same filename, e.g. ~/dir/test.vcf and ~/dir/test.vcf.tbi).
  • A task that takes a file family as input should take 1 input value, not N. In the user's mental model of computation to be done, the file family of N files is one entity, not N. Allowing the inputs/outputs to map to this model will improve the UX.
  • Helper utilities to enforce presence of files in the family at runtime (e.g., if you create one of these objects, and you're missing one the components, you can have an error raised)
  • It should be easy to extend these classes with additional methods/transformers to allow easy read out of the file families into memory. Typically this is done with another library (e.g., vcf + .vcf.tbi is loaded with pyvcf, a python library backing C-based htslib, a high performance library for loading/manipulating vcf files and their indices).

Describe alternatives you've considered

  • FlyteDirectory
  • Extend FlyteFile (suggested by Greg Gydush in slack https://flyte-org.slack.com/archives/CP2HDHKE1/p1663719669816229?thread_ts=1663692816.505609&cid=CP2HDHKE1)
  • Compose multiple FlyteFiles into a composite

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

CalvinLeather avatar Sep 22 '22 01:09 CalvinLeather

This discussion started in slack

CalvinLeather avatar Sep 22 '22 01:09 CalvinLeather

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Sep 05 '23 00:09 github-actions[bot]

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Sep 12 '23 01:09 github-actions[bot]