bio-datasets icon indicating copy to clipboard operation
bio-datasets copied to clipboard

How to store unusual labels / Y values.

Open sgrimbly opened this issue 4 years ago • 2 comments

I've successfully uploaded a dataset (subset of PDB) but it has unusual labels in that they are matrices. Storing matrices/ndarrays/sparse arrays as a column in a .csv is not ideal. If you're writing to and reading from these files with pandas you quickly land up with issues where \t and \n characters mess up the parsing. I have just uploaded a seperate pickle file with a dictionary of my labels, but it probably something the team should consider if you want the full datasets available in a single file.

Perhaps we could consider if there is some way to automate pulling separate labels files when calling a dataset. This would make no difference to the end user as we could hide some computation from the API. Let me know your thoughts 😃 .

sgrimbly avatar May 10 '21 13:05 sgrimbly

I am also happy to work on this at some point and open a PR 🚀 .

sgrimbly avatar May 10 '21 13:05 sgrimbly

Hey @sgrimbly, thanks for your contribution! Indeed, we will soon work on integrating other formats for the datasets, because .csv is clearly limited for some data structures. We will start brainstorming on that next week, and pickle should be one of the first other formats integrated. Will keep you updated 😄

theomeb avatar May 11 '21 13:05 theomeb