cleanlab icon indicating copy to clipboard operation
cleanlab copied to clipboard

Improve property-based test for near-duplicate sets

Open elisno opened this issue 2 years ago • 0 comments

Property-based tests for near-duplicate sets are randomly failing in CI, when some health-checks don't pass for generated data.

Stack trace

Every so often, CI randomly fails a test with this error:

FAILED tests/datalab/issue_manager/test_duplicate.py::TestNearDuplicateSets::test_near_duplicate_sets_empty_if_no_issue_next - hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 8 valid ones). Generating examples this large will usually lead to bad results. You could try setting max_size parameters on your collections and turning max_leaves down on recursive() calls.
See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.data_too_large to the suppress_health_check settings for this test.

The way the issue manager is constructed in this test rarely passes the health check. It's failing on unrelated PRs, slowing development down.

A temporary fix was to ignore the health check (suppressing the HealthCheck.data_too_large flag). That's not advisable in the long term, so investigating how to improve the data generation will be a great help!

Task

Improve the way Hypothesis generates the data for the affected test.

Update

In https://github.com/cleanlab/cleanlab/pull/902/commits/0f36966ef4246836224afe92a5ab00d91f2d2b5c, the health-check in question has been suppressed. So when working on this issue, remember to remove the HealthCheck.data_too_large from suppress_health_check and make sure we can scale to more examples without issues.

elisno avatar Dec 07 '23 03:12 elisno