deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

Cleanlab + Skorch Integration

Open lowlypalace opened this issue 3 years ago • 1 comments

🚀 🚀 Pull Request

Checklist:

  • [ ] My code follows the style guidelines of this project and the Contributing document
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have kept the coverage-rate up
  • [ ] I have performed a self-review of my own code and resolved any problems
  • [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
  • [x] I have described and made corresponding changes to the relevant documentation
  • [ ] New and existing unit tests pass locally with my changes

Changes

This PR is an integration of cleanlab open-source library to Hub. This is a quick snippet of the API:

from hub.integrations.cleanlab import clean_labels, create_tensors, clean_view
from hub.integrations import skorch

ds = hub.load("hub://ds")

tform = transforms.Compose(
    [
        transforms.ToPILImage(),
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)),
    ]
)

transform = {"images": tform, "labels": None}

# Get scikit-learn compatible PyTorch module to pass into clean_labels
model = skorch(dataset=ds, epochs=5, batch_size=16, transform=transform)

# Obtain a DataFrame with columns is_label_issue, label_quality and predicted_label 
label_issues = clean_labels(
    dataset=ds,
    model=model,
    folds=3,
)


# Create label_issues tensor
create_tensors(
    dataset=ds,
    label_issues=label_issues,
    branch="main"
)

# Get dataset view where only clean labels are present, and the rest are filtered out.
ds_clean = clean_view(ds)

To-do

  • [x] Create custom config for pip install (e.g. pip install hub[’cleanlab’])
  • [x] Add support for validation set
  • [ ] Add prune support to delete samples with where is_label_issue = True
  • [x] Try to use a pre-trained model to compute out-of-sample probabilities to skip cross-validation and speed up the training.
  • [x] Add tests for the functions
  • [x] Add types for the class arguments
  • [x] Create a tensor guessed_label to add labels guessed by the classifier after pruning.
  • [x] Add optional cleanlab kwargs to pass down
  • [x] Add optional skorch kwargs to pass down
  • [ ] Add support for TensorFlow modules
  • [x] Add flag branch to move to a different branch instead of making a commit on a current branch.

lowlypalace avatar Aug 17 '22 21:08 lowlypalace

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 17 '22 21:08 CLAassistant

Hi, @lowlypalace I want to know why needing to integrate skorch ?

LangDaoAI avatar Feb 06 '23 08:02 LangDaoAI

Hey @lowlypalace closing for now. Will reopen once there is time to work on this. Thanks

tatevikh avatar Mar 30 '23 13:03 tatevikh