cleanlab-studio
cleanlab-studio copied to clipboard
improved autofix strategy
Skeleton code for improved Auto-Fix strategies
from cleanlab_studio import Studio
API_KEY = os.environ['CLEANLAB_API_KEY']
studio = Studio(API_KEY)
df = pd.DataFrame(...)
dataset_id = studio.upload_dataset(df)
project_id = studio.create_project(dataset_id=dataset_id, ...)
cleanset_id = studio.get_latest_cleanset_id(project_id)
# Beginner user:
new_df = studio.autofix_dataset(df, cleanset_id) # deepcopy of df
# Advanced user pattern:
hyperparam_dict = get_autofix_defaults(cleanset_id) # contains integer values correspond to number of data points to fix/exclude for each issue-type
# user who wants to edit less data will manually adjust the integers in hyperparam_dict
new_df = studio.autofix_dataset(df, cleanset_id, params=hyperparam_dict)
Link to Notion: https://www.notion.so/cleanlab/Improve-ML-accuracy-with-Studio-via-better-Autofix-99434fa92a164131b3860093d85e5350?pvs=4
Note: this is only for text/tabular datasets, not image.
request my review when this is ready
add a little script on how the user is going to use this thing as a PR comment
from anish: Would you want this to be:
studio.autofix_dataset(cleanset_id) new_df = studio.apply_corrections(df, cleanset_id)