[FEATURE] single-CSV auto ingestion

Open verbiiyo opened this issue 4 years ago • 1 comments

🚨🚨 Feature Request

Related to #1156 (this issue is a building block to it)

Description

build the implementation of the API:

ds = ... # create dataset


"""
columns/structure of the file found at: "path/to/file.csv"

--------------------
col1, col2, col3
"text", 1, 5
"another_text", 1, 5.5
"""

ds.ingest("path/to/file.csv")

assert ds.col1.data() == ["text", "another_text"]
assert ds.col2.numpy(as_list=True) == [1, 1]
assert ds.col3.numpy(as_list=True) == [1, 5.5]

assert len(ds) == 2 # num rows

#1156 is the feature to support a directory of CSV files, but a single CSV file is sufficient for now.

Solution

you can probably use pandas to read the CSV file and then create one tensor per column. set the dtype for each column as the dtype found in pandas for that column. the htype can only be determined for text for now. in the future, we would want to infer htypes but now it isn't required.

i would create a new subdirectory + file under here called "structured/csv.py" and implement your code there, and following the same code structure as the unstructured classes found here.

Dec 07 '21 19:12 verbiiyo

Hey @nollied We're having a quiet week this week... half of us are at CES, and the other half are on vacation. We'll get back to you next week!

Jan 07 '22 18:01 istranic