Functional API

Open sarahmish opened this issue 4 years ago • 0 comments

Since we have the Cardea Class, it would also be beneficial to add a layer of functional interfaces that allows using Cardea with as few steps as possible. The design of the functional API would be problem centric as in, there will be a function for each given problem.

The functional api hides away all the nitty gritty details of composing a cardea pipeline, it is designed to return to the user a fitted pipeline on a given raw dataset. The user can then use the cardea instance to:

make predictions on a new source data (not necessarily future).
make predictions on future data.
save/load the cardea instance.

Design

def model_pred_prob(data_path: str, 
                    fhir: bool = True,
                    pipeline: Union[str, dict, MLPipeline] = DEFAULT_PIPELINE, 
                    hyperparameters: Union[str, pd.DataFrame] = None, 
                    max_depth: int = 1,
                    max_features: int = -1, 
                    n_jobs: int = 1, 
                    test_size: float = 0.2,
                    shuffle: bool = True, 
                    tune: bool = False, 
                    max_evals: int = 10,
                    scoring: str = None, 
                    evaluate: bool = False,
                    metrics: List[str] = DEFAULT_METRICS, 
                    return_lt: bool = False,
                    return_fm: bool = False, 
                    return_pred: bool = False, 
                    verbose: bool = False,
                    save_path: str = None) -> Cardea:
    """Create and train a cardea instance on a specific prediction problem.

    Return a cardea class object that has been trained on the given
    dataset. The function loads the data, extracts label times, generates
    features, then trains the pipeline all in one command.

    Args:
        data_path (str):
            A directory of all .csv files that should be loaded.
        fhir (bool):
            An indicator whether FHIR or MIMIC schema is used.
        pipeline (str or MLPipeline or dict):
            Pipeline to use. It can be passed as:
                * An ``str`` with a path to a JSON file.
                * An ``str`` with the name of a registered pipeline.
                * An ``str`` with the path to a pickle file.
                * An ``MLPipeline`` instance.
                * A ``dict`` with an ``MLPipeline`` specification.
        hyperparameters (str or dict):
            Hyperparameters to set to the pipeline. It can be passed as
            a hyperparameters ``dict`` in the ``mlblocks`` format or as
            a path to the corresponding JSON file. Defaults to ``None``.
        max_depth (int):
            Maximum allowed depth of features.
        max_features (int):
            Cap to the number of generated features. If -1, no limit.
        n_jobs (int):
            Number of parallel processes to use when calculating the
            feature matrix.
        test_size (float):
            The proportion of the dataset to include in the test dataset.
        shuffle (bool):
            Whether or not to shuffle the data before splitting.
        tune (bool):
            Whether to optimize hyper-parameters of the pipelines.
        max_evals (int):
            Maximum number of hyper-parameter optimization iterations.
        scoring (str):
            The name of the scoring function used in the hyper-parameter
            optimization.
        evaluate (bool):
            Whether to evaluate the performance of the pipeline. If True,
            we evaluate the performance on the test data, if not given,
            evaluate on train data.
        metrics (list):
            A list of scoring function names. The scoring functions should
            be consistent with the problem type.
        return_lt (bool):
            Whether to return ``label_times``.
        return_fm (bool):
            Whether to return the calculated feature matrix.
        return_pred (bool):
            Whether to return the predictions of the pipeline.
        verbose (bool):
            Whether to show information during processing.
        save_path (str):
            Path to the file where the fitted pipeline will be stored
            using ``pickle``.

        Returns:
            Cardea, dict:
                * A fitted Cardea instance.
                * Intermediary outputs when indicated.
        """

      pass

Apr 20 '21 22:04 sarahmish