Add `get_random_subset` poc utility function

Open frances-h opened this issue 1 year ago • 0 comments

Problem Description

We should create a utility function that will allow users to coherently subsample a very large dataset so that it can be used with HMA.

Expected behavior

Add a new function, get_random_subset to the utils.poc module.

>>> from sdv.utils import poc

>>> small_dataset = poc.get_random_subset(
            data,
            metadata,
            main_table_name='transactions',
            num_rows=1000,
            verbose=True
)
Success! Your subset has 90% less rows than the original.

Table Name    # Rows (Original)    # Rows (Subset)
sessions      1200                 120            
transactions  5000                 200

Parameters

data [dict] - The data dictionary
metadata [MultiTableMetadata] - The metadata for the data
main_table_name [str] - The main table to consider when subsampling
num_rows [int] - The number of rows to subsample from the main table
verbose [bool, optional] - Whether to print a summary of the results of subsampling. Defaults to True.

Returns

The data dictionary containing the subsampled tables
If verbose is True, it should also print a summary of what the function did:
- The total percentage of rows that were dropped (i.e. total number of rows in the subsampled data / total number of rows in the original data)
- For each table, the original number of rows in the table and the number of rows in the subsampled table

Algorithm Overview

[For disconnected schemas, which we don't currently support but may in the future]
- Calculate ratio of num_rows to the original table size for the main table
- For every root table that is disconnected root from the main table:
  - subsample the root table using ratio found above
Randomly sample num_rows rows from the main table
If the main table has any parents, for each parent:
- If all parent rows were referenced in the original main table, drop all parent rows that are no longer referenced by the subsampled main table
- If there were parent rows that were not referenced (aka childless parent rows) in the original main table, drop any rows that had a reference and are now no longer referenced. Determine the percentage of referenced rows that were dropped, and randomly drop the same percentage of the originally unreferenced parent rows
- Repeat this process for grandparents, great-grandparents, etc. Note that if we have e.g. a diamond shaped relationship (main table has 2 parents that each share the same parent), we want to be keep all rows in the grandparent that are referenced by either parent.
Use drop unknown references to enforce referential integrity and drop rows from the descendant tables. Note that this should not change the size of the main table since we only drop unreferenced rows from the parent tables.
Perform validation:
- If any subsampled table has no rows, raise an error and suggest re-trying or increasing the num_rows parameter
  - This could happen if a parent is aggressively subsampled, causing drop_unknown_references to wipe out a child
  - Since we are randomly sampling, re-trying may give a better result
If verbose, print how the results of data was subsampled:
- Percentage of rows dropped (total number of subsampled rows / original total number of rows)
- For each table, print original number of rows vs subsampled number of rows

Mar 27 '24 20:03 frances-h