Redesigning the Cardea Class

Open sarahmish opened this issue 4 years ago • 1 comments

This issue is to track the development of the Cardea class. Previous updates were also mentioned in #73.

The Cardea class is responsible for handling and interacting with all the components (data_assembler, data_labeler, featurizer, and modeler).

Overall, the Cardea class:

Provides simple user-facing abstractions
- label: generate label times
- featurize: generate feature matrix
- fit/predict
- evaluate
- save/load
Hides away the interaction with other systems
- Entityset
- Featuretools DeepFeatureSynthesis
- ComposeML
- MLBlocks Pipelines
- Pipeline Selection and Tuning

design choices:

remove load_entityset and make the assumption that a cardea instance only deals with one data source. The data is loaded upon instantiation.
change generate_label_time -> label.
change generate_feature_matrix -> featurize.
allow the user to inspect label_times and feature_matrix.

This should be the class public interface:

class Cardea:

    def __init__(self, 
                 data: str = DEFAULT_DATA, 
                 labeler: FunctionType = DEFAULT_LABELER,
                 pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE, 
                 hyperparameters: dict = None):
        pass

    def label(self, 
              labeler: FunctionType = None,
              parameter: dict = None) -> pd.DataFrame:
        """Create label times using the data labeler.

        Args:
            labeler (function):
                Function that defines the prediction task, it should return a
                tuple of labeling function, the dataframe, and the name of the
                target entity.
            parameter (dict):
                Variables to change the default parameters, if any.

        Returns:
            pandas.DataFrame:
                A dataframe of cutoff times and their target labels.
        """
        pass

    def featurize(self, 
                  label_times: pd.DataFrame,
                  verbose: bool = False) -> pd.DataFrame:
        """Returns a the calculated feature matrix.

        Args:
            label_times (pandas.DataFrame):
                A dataframe that indicates cutoff time for each instance.
            verbose (bool):
                Indicate verbosity of the featurization.

        Returns:
            pandas.DataFrame:
                Generated feature matrix.
        """
        pass

    def fit(self, 
            X: Union[np.ndarray, pd.DataFrame], 
            y: Union[np.ndarray, pd.Series, list],
            tune: bool = False, 
            max_evals: int = 10, 
            scoring: str = None,
            verbose: bool = False) -> None:
        """Train the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.
        """
        pass

    def predict(self, X: Union[np.ndarray, pd.DataFrame]) -> Union[np.ndarray, list]:
        """Get predictions from the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.

        Returns:
            numpy.ndarray or list:
                Predictions to the input data.
        """
        pass

    def fit_predict(self, 
                    X: Union[np.ndarray, pd.DataFrame],
                    y: Union[np.ndarray, pd.Series, list], 
                    tune: bool = False,
                    max_evals: int = 10, 
                    scoring: str = None,
                    verbose: bool = False) -> Union[np.ndarray, list]:
        """Train a cardea pipeline then make predictions.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            numpy.ndarray:
                Predictions to the input data.
        """
        pass

    def evaluate(self, 
                 X: Union[np.ndarray, pd.DataFrame], 
                 y: Union[np.ndarray, pd.Series, list],
                 test_size: float = 0.2, 
                 shuffle: bool = True, fit: bool = False,
                 tune: bool = False, 
                 max_evals: int = 10, 
                 scoring: str = None,
                 metrics: List[str] = DEFAULT_METRICS, 
                 verbose: bool = False) -> pd.Series:
        """Evaluate the cardea pipeline.

        Args:
            X (pandas.DataFrame or numpy.ndarray):
                Inputs to the pipeline.
            y (pandas.Series, numpy.ndarray or list):
                Target values.
            test_size (float):
                The proportion of the dataset to include in the test dataset.
            shuffle (bool):
                Whether or not to shuffle the data before splitting.
            fit (bool):
                Whether to fit the pipeline before evaluating it.
                Defaults to ``False``.
            tune (bool):
                Whether to optimize hyper-parameters of the pipelines.
            max_evals (int):
                Maximum number of hyper-parameter optimization iterations.
            scoring (str):
                The name of the scoring function used in the hyper-parameter optimization.
            metrics (list):
                A list of scoring function names. The scoring functions should be consistent
                with the problem type.
            verbose (bool):
                Whether to log information during processing.

        Returns:
            pandas.Series:
                ``pandas.Series`` containing one element for each
                metric applied, with the metric name as index.
        """
        pass

    def save(self, path: str) -> None:
        """Save this object using pickle.

        Args:
            path (str):
                Path to the file where the serialization of
                this object will be stored.
        """
        pass

    def load(cls, path: str) -> Cardea:
        """Load an Cardea instance from a pickle file.

        Args:
            path (str):
                Path to the file where the instance has been
                previously serialized.

        Returns:
            Cardea:
                A Cardea instance

        Raises:
            ValueError:
                If the serialized object is not an Cardea instance.
        """
        pass

In addition to the main APIs, there will be helper functions such as set_pipeline, and train_test_split to allow the users to have a bit of flexibility in modifying the MLPipeline and hyperparameters.

Apr 01 '21 23:04 sarahmish

Additional requirement:

add additional data: new function to allow users to specify another path to data which contains additional tables and/or columns.

Proposing the following change:

make load_entityset method create an entityset from scratch.
create a new function add_entities that expects the data path of new data.

Apr 07 '21 05:04 sarahmish