smartcore icon indicating copy to clipboard operation
smartcore copied to clipboard

Refactor linalg module

Open VolodymyrOrlov opened this issue 4 years ago • 3 comments

I want to use this issue to share a heads-up on a big refactoring that I plan for the linalg module.

During last couple of month I've seen on multiple occasions limitations and shortcomings imposed by the current design of the BaseVector and BaseMatrix. To mention a couple here:

  • It is not possible to define an instance of BaseMatrix that holds string, integer type values.
  • BaseMatrix is not designed to hold values that belong to multiple types
  • Some algorithms, e.g. RandomForest, does not use most methods defined in the BaseMatrix and BaseVector. Some preprocessing methods that we plan for future, like LabelEncoder will not need linear algebra routines defined for both classes.
  • Some basic operations, like get row or get column, perform unnecessary copy. This problem stems from the fact that both structs do not provide views or iterators that lets developer access an internal structure of the data.
  • All operations are defined as functions. While this is not a big deal it leads to a clumsy looking code. Instead it would be nice to use more traits defined in std::ops

As a result, I'd like to see how can we use Rust's type system to design a better container for data that solves all these shortcomings.

I am open to any suggestions you have. Feel free to post your ideas here.

VolodymyrOrlov avatar Feb 26 '21 18:02 VolodymyrOrlov

As for the multi-type, Perhaps we can use something like Apache Arraow's arrays to construct a multi-type dataframe (https://docs.rs/arrow/3.0.0/arrow/record_batch/struct.RecordBatch.html).

We can have a general trait something like BaseDataFrame and put data access (get_row, get_col etc) in there.

we can translate BaseDataFrame object to BaseMatrix objects to be used with purely numerical algorithms (e.g. linear models ) and adapt Tree based algorithms to work with BaseDataFrame,

What do you think?

gaxler avatar Feb 28 '21 15:02 gaxler

@gaxler I like the idea of building 2- and 1- dimensional array on top of Apache Arrow, but I'd like it to be optional. The linalg module provides an abstraction layer around data, it should not force developers to use any concrete libraries like ndarray, nalgebra or apache arrow to hold the data. I'd like to keep it this way to make sure we can add new array types as easy as possible.

Btw, last time I've checked Arrow's Arrays are immutable. Not sure how we will use immutable array in algorithms that need to change a small portion of the data while leaving every other cell intact.

VolodymyrOrlov avatar Mar 05 '21 20:03 VolodymyrOrlov

@VolodymyrOrlov I agree on Arrow being optional, I meant to use Arrow as some form of reference for multi-datatpye structures.

BaseDataFrame trait will be the general API.

As for mutability, you are right. We won't be able to do something like in place decomposition and will need to produce a new matrix. I don't mean for the DataFrame to replace Matrix<RealNumner>, in this case I guess Arrow will be used to preprocess and cast to a real metrix.

gaxler avatar Mar 21 '21 22:03 gaxler