Improve property-based test for near-duplicate sets
Property-based tests for near-duplicate sets are randomly failing in CI, when some health-checks don't pass for generated data.
Stack trace
Every so often, CI randomly fails a test with this error:
FAILED tests/datalab/issue_manager/test_duplicate.py::TestNearDuplicateSets::test_near_duplicate_sets_empty_if_no_issue_next - hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 8 valid ones). Generating examples this large will usually lead to bad results. You could try setting max_size parameters on your collections and turning max_leaves down on recursive() calls.
See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.data_too_large to the suppress_health_check settings for this test.
The way the issue manager is constructed in this test rarely passes the health check. It's failing on unrelated PRs, slowing development down.
A temporary fix was to ignore the health check (suppressing the HealthCheck.data_too_large flag). That's not advisable in the long term, so investigating how to improve the data generation will be a great help!
Task
Improve the way Hypothesis generates the data for the affected test.
Update
In https://github.com/cleanlab/cleanlab/pull/902/commits/0f36966ef4246836224afe92a5ab00d91f2d2b5c, the health-check in question has been suppressed. So when working on this issue, remember to remove the HealthCheck.data_too_large from suppress_health_check and make sure we can scale to more examples without issues.