SDV
SDV copied to clipboard
Add `get_random_subset` poc utility function
Problem Description
We should create a utility function that will allow users to coherently subsample a very large dataset so that it can be used with HMA.
Expected behavior
Add a new function, get_random_subset to the utils.poc module.
>>> from sdv.utils import poc
>>> small_dataset = poc.get_random_subset(
data,
metadata,
main_table_name='transactions',
num_rows=1000,
verbose=True
)
Success! Your subset has 90% less rows than the original.
Table Name # Rows (Original) # Rows (Subset)
sessions 1200 120
transactions 5000 200
Parameters
-
data [dict]- The data dictionary -
metadata [MultiTableMetadata]- The metadata for the data -
main_table_name [str]- The main table to consider when subsampling -
num_rows [int]- The number of rows to subsample from the main table -
verbose [bool, optional]- Whether to print a summary of the results of subsampling. Defaults to True.
Returns
- The data dictionary containing the subsampled tables
- If
verboseis True, it should also print a summary of what the function did:- The total percentage of rows that were dropped (i.e. total number of rows in the subsampled data / total number of rows in the original data)
- For each table, the original number of rows in the table and the number of rows in the subsampled table
Algorithm Overview
- [For disconnected schemas, which we don't currently support but may in the future]
- Calculate ratio of
num_rowsto the original table size for the main table - For every root table that is disconnected root from the main table:
- subsample the root table using ratio found above
- Calculate ratio of
- Randomly sample
num_rowsrows from the main table - If the main table has any parents, for each parent:
- If all parent rows were referenced in the original main table, drop all parent rows that are no longer referenced by the subsampled main table
- If there were parent rows that were not referenced (aka childless parent rows) in the original main table, drop any rows that had a reference and are now no longer referenced. Determine the percentage of referenced rows that were dropped, and randomly drop the same percentage of the originally unreferenced parent rows
- Repeat this process for grandparents, great-grandparents, etc. Note that if we have e.g. a diamond shaped relationship (main table has 2 parents that each share the same parent), we want to be keep all rows in the grandparent that are referenced by either parent.
- Use drop unknown references to enforce referential integrity and drop rows from the descendant tables. Note that this should not change the size of the main table since we only drop unreferenced rows from the parent tables.
- Perform validation:
- If any subsampled table has no rows, raise an error and suggest re-trying or increasing the
num_rowsparameter- This could happen if a parent is aggressively subsampled, causing
drop_unknown_referencesto wipe out a child - Since we are randomly sampling, re-trying may give a better result
- This could happen if a parent is aggressively subsampled, causing
- If any subsampled table has no rows, raise an error and suggest re-trying or increasing the
- If verbose, print how the results of data was subsampled:
- Percentage of rows dropped (total number of subsampled rows / original total number of rows)
- For each table, print original number of rows vs subsampled number of rows