dataset staging and validation api
Add an api call that performs first-pass validation on a dataset for use when uploading a new dataset, See registerDatafile() method in lab\init\loadInitialDatasets.py.
Needed for #119
Proposed dataset staging strategy -
Will create a docker shared anon directory /appsrc/cache to contain cached versions of dataset that are being staged for upload.
To add a new dataset, there will now be two API pathways:
-
Register a dataset with a single API call that includes the data and the complete dataset specification (target field, cat and ord column specifications). If the dataset is invalid, an error will be returned and nothing will be saved. If the data is valid, it will be registered.
-
Use a multi API process to stage, interrogate and then register a staged dataset.
- Stage a dataset by passing just the data to the server with an api call, the server will return an identifier to use with the staged data.
- Webserver uses api calls to get details about the stagged dataset in order to build a UI to prompt the user for dataset details and display a data preview.
- A 'submit' api call can be used to attempt to register the dataset and clear it out of the cache. If the data is valid, it will be registered and cache cleared. If it is invalid the cache will remain and the errors will be returned.
Open questions -
- Should registered dataset data be stored in a directory that is shared between the lab instance and the machine instances, or should it be stored in mongoDb?
- If the dataset contains catagorical data, should the encoding be done just once at the time of dataset registration or should it be done by the machine instance as a preprocessing step prior to running an experiment?
Simple dataset registration api: /dataset/{}
Proposed Staging dataset api: /dataset/stage {filedata, filename, delimeter} - add the file to the staging cache, return a reference id
/dataset/stage/{id}/getColumns() - return a list of all columns /dataset/stage/{id}/getColumnValues(column, all) - return up to 50(?) unique values in a column. If 'all' == true, return all unique values even if > 50 /dataset/stage/{id}/getDataRows(start, count) - return 'count' data rows starting from row 'start' /dataset/stage/{id}/suspectedCatagorical(column) - return boolean; if true the column has string values and is suspected to be catagorical /dataset/stage/{id}/validate(targetColumn, catagoricalColumns, ordinalColumns) - performs validation, returns errors/warnings /dataset/stage/{id}/register(targetColumn, catagoricalColumns, ordinalColumns[ordered values]) - Validate the data. If valid, register the data with mongoDb and clear the cache. If not, keep the cache and return errors.