smartcore icon indicating copy to clipboard operation
smartcore copied to clipboard

Implement a generic read_csv method

Open VolodymyrOrlov opened this issue 5 years ago • 3 comments

In many cases data analysis starts from loading dataset into memory. Some datasets comes as a CSV file. We need a new default function read_csv that is defined on the BaseMatrix trait.

This story is not fully defined and a lot of details should be discussed prior to working on implementation. For example, I am not sure what parameters (if any) his function should take. Some ideas can be borrowed from the similar function in Pandas

VolodymyrOrlov avatar Jan 14 '21 02:01 VolodymyrOrlov

I'd like to give it a try!

abhikjain360 avatar Feb 27 '21 16:02 abhikjain360

Hi @abhikjain360, sounds good!

The basic idea is to define a new function in the BaseVector trait that loads data from CSV file. Once the function read_csv is defined in BaseVector trait it will be automatically available for every type of matrix we support. If the function's definition is too generic, it can always be redefined by a concrete implementation of the matrix later.

One way to implement the function is to read a file first, and use one of the matrix initialization functions to create an instance of BaseMatrix and then push the values into the matrix using set method. I am also open to any additional abstract method you might find useful. E.g. you might want to define a new method on BaseMatrix that can initialize a matrix directly from an iterator.

Things to keep in mind. I plan to redesign BaseMatrix and BaseVector in #85 . One of the problems you will face is a lack of support for integer and string data types. For now feel free to limit method read_csv to floats only.

Let me know if you are stuck or have any questions!

VolodymyrOrlov avatar Feb 27 '21 18:02 VolodymyrOrlov

okay!

Seeing the read_csv of pandas, I think it would be better to provide something like a builder struct which implements Default, and functions to change the reading options. Should I add a ReadCsv struct in the same file as BaseMatrix<T>, or should I create a seperate file?

Also, in case of errors should I just reuse the Failed? As it uses FailedError which does not cover the cases when reading a file, should I add more options to it or create a new type specific to parsing files? In case of latter I think we can just use the std::io::Error from the standard library.

abhikjain360 avatar Feb 28 '21 12:02 abhikjain360

Has this been implemented/resolved?

kastolars avatar Sep 02 '22 00:09 kastolars

I could not find anything in the library so far, and I am currently looking into it :)

titoeb avatar Sep 02 '22 06:09 titoeb