spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-51847][PYTHON] Extend PySpark testing framework util functions with basic data tests

Open stanlocht opened this issue 10 months ago • 5 comments

What changes were proposed in this pull request?

This PR extends the PySpark testing framework with four new utility functions for data quality and integrity testing:

  1. assertColumnUnique: Verifies that specified column(s) contain only unique values
  2. assertColumnNonNull: Checks that specified column(s) do not contain null values
  3. assertColumnValuesInSet: Ensures all values in specified column(s) are within a given set of accepted values
  4. assertReferentialIntegrity: Validates that all non-null values in a source column exist in a target column (similar to foreign key constraints)

Why are the changes needed?

These new utility functions address this gap by providing standardized, well-tested implementations of the most common data quality checks. They reduce boilerplate code, improve test readability, and enable testing patterns similar to those in popular data testing frameworks like dbt.

Does this PR introduce any user-facing change?

Yes, this PR introduces new public utility functions in the pyspark.testing module. These are additive changes that don't modify existing functionality.

Example usage:

from pyspark.testing import assertColumnUnique, assertReferentialIntegrity

# Check that 'id' column contains only unique values
assertColumnUnique(df, "id")

# Check that all customer_ids in orders exist in customers.id
assertReferentialIntegrity(orders, "customer_id", customers, "id")

How was this patch tested?

Comprehensive tests were added for all new functions in python/pyspark/sql/tests/test_utils.py. The tests cover:

  • Basic functionality with valid inputs
  • Error cases with invalid inputs
  • Edge cases (e.g., null values, empty DataFrames)
  • Different DataFrame types (Spark, pandas, pandas-on-Spark)
  • Detailed validation of error messages

Each function has multiple test methods that verify both positive and negative test cases. For example, assertReferentialIntegrity has tests for valid relationships, invalid relationships with a single missing value, multiple missing values, and proper handling of null values.

All tests pass on the current master branch.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude 3.7 Sonnet

stanlocht avatar Apr 18 '25 16:04 stanlocht

cc @asl3 fyi

HyukjinKwon avatar Apr 20 '25 23:04 HyukjinKwon

remove [TESTS] from title since it seems a new user-facing feature

zhengruifeng avatar Apr 21 '25 03:04 zhengruifeng

Hi @HyukjinKwon, @zhengruifeng, @asl3 — just following up to see if you might have a chance to review the PR when time allows. Appreciate your time and input!

stanlocht avatar May 01 '25 11:05 stanlocht

@asl3 thanks for you review! i've implemented your suggestions. would be great if you could check it out when you have the time!

stanlocht avatar May 09 '25 15:05 stanlocht

Hi @HyukjinKwon, @zhengruifeng, @asl3 — just checking in on this PR. I’ve made the changes based on the earlier feedback, so let me know if there’s anything else you’d like to see. Would be great to get this moving if/when you have a moment. Thanks again!

stanlocht avatar May 22 '25 13:05 stanlocht

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Aug 31 '25 00:08 github-actions[bot]

Hey @HyukjinKwon and @zhengruifeng, would it be possible to re-open this PR and remove the stale tag? It is a feature that would be really useful for my (and hopefully others) use cases. @asl3 has already approved it! Thanks for your time

stanlocht avatar Oct 01 '25 12:10 stanlocht