questdb icon indicating copy to clipboard operation
questdb copied to clipboard

feat(sql): add regr_r2, regr_rmse, and regr_tstats aggregate functions

Open ritoban23 opened this issue 3 months ago • 3 comments

Add regr_r2, regr_rmse, and regr_tstats aggregate functions

This PR implements three new SQL aggregate functions for regression analysis in QuestDB:

  • regr_r2(y, x): Coefficient of determination (R²)
  • regr_rmse(y, x): Root Mean Square Error (RMSE)
  • regr_tstats(y, x): t-statistic for regression slope significance

Fixes #6320

Features

  • Numerically stable single-pass computation (Welford’s algorithm)
  • Handles all edge cases: nulls, zero variance, perfect fits, overflow
  • Parallel aggregation support via merge()
  • Compatible with PostgreSQL’s regr_* semantics

Testing

  • Total tests: 40 (13 R², 13 RMSE, 14 t-stats)
  • Status: All tests passing
  • Covers: perfect fits, null handling, type conversions, overflow, edge cases

Files Added

  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionR2FunctionFactory.java
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionRmseFunctionFactory.java
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionTStatsFunctionFactory.java
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionR2FunctionFactoryTest.java
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionRmseFunctionFactoryTest.java
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionTStatsFunctionFactoryTest.java

ritoban23 avatar Oct 28 '25 20:10 ritoban23

Walkthrough

This pull request introduces three new group-by regression aggregate functions: regr_r2 (R-squared), regr_rmse (root mean square error), and regr_tstats (t-statistics). Each function includes a factory implementation that supports parallelism, null handling, and online aggregation with merge semantics. Comprehensive test suites validate behavior across edge cases including null values, identical values, perfect fits, and overflow scenarios.

Changes

Cohort / File(s) Summary
Factory implementations
core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionR2FunctionFactory.java, RegressionRmseFunctionFactory.java, RegressionTStatsFunctionFactory.java
Three new group-by function factories implementing regr_r2(DD), regr_rmse(DD), and regr_tstats(DD). Each maintains online aggregates (meanY, meanX, sumXY, sumX, sumY, count), supports parallel merging with numerically stable formulas, handles edge cases (division by zero, negative residuals), and exposes binary function interfaces.
Test suites
core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionR2FunctionFactoryTest.java, RegressionRmseFunctionFactoryTest.java, RegressionTStatsFunctionFactoryTest.java
Three test classes each with ~12–13 test methods covering null inputs, no values, identical values, perfect linear fits, mixed null/valid data, float/int types, single values, overflow behavior, and linear relationships with intercepts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Statistical formula correctness for R², RMSE, and t-statistics calculations, including slope and standard error derivations
  • Edge case handling: division by zero safeguards, negative residual clamping, NaN propagation, and count thresholds
  • Parallel merge logic: correct weighting and combination of partial aggregates across partitions
  • Test coverage: validate all edge cases and ensure expected behavior for null, empty, and overflow scenarios

Suggested labels

Enhancement, SQL

Suggested reviewers

  • puzpuzpuz
  • nwoolmer

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Linked Issues Check ⚠️ Warning The linked issue #6320 requests three group-by aggregate functions: regr_r2, regr_rmse, and regr_tstats. While the PR provides factory implementations for all three functions and marks them as group-by capable, critical integration gaps have been identified by reviewers. According to the PR objectives, function registration in core/src/main/resources/function_list.txt is missing, which prevents these functions from being available as SQL aggregate functions in the system. Additionally, a correctness issue exists where the TStats function returns ±Infinity for perfect fits while tests expect null, indicating either a code or test inconsistency that needs resolution. These gaps mean the primary objective of making these functions available as working SQL aggregates has not been fully achieved.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Out of Scope Changes Check ✅ Passed All code changes in the PR are directly related to implementing the three regression aggregate functions requested in issue #6320. The changeset includes three factory implementations (RegressionR2FunctionFactory, RegressionRmseFunctionFactory, RegressionTStatsFunctionFactory) and corresponding comprehensive test classes. Every file added focuses specifically on implementing, testing, or supporting these three functions. There are no changes to unrelated functionality, utilities, or other system components that would fall outside the scope of the stated objective.
Description Check ✅ Passed The pull request description is directly related to the changeset and provides meaningful context about the implementation. It explains what each of the three functions does (R², RMSE, t-statistics), highlights key features (numerically stable computation, edge case handling, parallel aggregation support), references the associated issue (#6320), lists the test coverage, and enumerates the files added. This level of detail appropriately documents the changeset without being excessive.
Title check ✅ Passed The PR title accurately and specifically summarizes the main change: adding three new SQL aggregate functions (regr_r2, regr_rmse, regr_tstats) for regression statistics, which is the primary focus of this changeset.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 28 '25 20:10 coderabbitai[bot]

@coderabbit review. Check for implementation completeness, check for test coverage, check for thread-safety and that multi-threaded tests are present. Also check if there are performance issues, java allocations, or redundant execution on critical path.

bluestreak01 avatar Oct 31 '25 23:10 bluestreak01

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the three implementation files
echo "=== RegressionR2FunctionFactory.java ==="
cat core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionR2FunctionFactory.java

echo -e "\n\n=== RegressionRmseFunctionFactory.java ==="
cat core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionRmseFunctionFactory.java

echo -e "\n\n=== RegressionTStatsFunctionFactory.java ==="
cat core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionTStatsFunctionFactory.java

Length of output: 29231


🏁 Script executed:

#!/bin/bash
# Read the three test files
echo "=== RegressionR2FunctionFactoryTest.java ==="
cat core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionR2FunctionFactoryTest.java

echo -e "\n\n=== RegressionRmseFunctionFactoryTest.java ==="
cat core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionRmseFunctionFactoryTest.java

echo -e "\n\n=== RegressionTStatsFunctionFactoryTest.java ==="
cat core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionTStatsFunctionFactoryTest.java

Length of output: 17496


🏁 Script executed:

#!/bin/bash
# Check if there are similar aggregate function implementations to compare patterns
echo "=== Looking for similar regression function implementations ==="
fd -e java "Regr.*FunctionFactory.java" core/src/main/java/io/questdb/griffin/engine/functions/groupby/

echo -e "\n=== Looking for example aggregate function with merge support ==="
# Find a few other groupby functions to compare implementation patterns
fd -e java "FunctionFactory.java" core/src/main/java/io/questdb/griffin/engine/functions/groupby/ | head -5

Length of output: 1215


🏁 Script executed:

#!/bin/bash
# Check for multi-threaded test infrastructure in the codebase
echo "=== Searching for parallel/multi-threaded test patterns ==="
rg -n "parallel" core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ -A2 -B2 | head -40

echo -e "\n=== Searching for multi-thread or concurrent test patterns ==="
rg -n "thread|concurrent" core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ -A2 -B2 | head -40

Length of output: 123


🏁 Script executed:

#!/bin/bash
# Look at one example of existing aggregate function to compare implementation pattern
echo "=== Example: CorrFunctionFactory (similar correlation function) ==="
fd -e java "CorrFunctionFactory.java" core/src/main/java/io/questdb/griffin/engine/functions/groupby/ --exec cat {}

Length of output: 67


🏁 Script executed:

#!/bin/bash
# Check if the new functions are registered in the function factory
echo "=== Searching for function registration ==="
rg -n "regr_r2|regr_rmse|regr_tstats" core/src/main/java/io/questdb/griffin/

echo -e "\n=== Looking at FunctionFactory registration file ==="
fd -e java "FunctionFactoryCache" core/src/main/java/io/questdb/griffin/ --exec cat {}

Length of output: 11413


🏁 Script executed:

#!/bin/bash
# Look at an existing regression function for comparison
echo "=== RegressionSlopeFunctionFactory (existing similar function) ==="
cat core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionSlopeFunctionFactory.java

Length of output: 8491


🏁 Script executed:

#!/bin/bash
# Check for multi-threaded test infrastructure in the entire test suite
echo "=== Looking for parallel/concurrent test base classes ==="
rg -n "class.*Parallel|class.*Concurrent" core/src/test/java/io/questdb/test/ --type java | head -20

echo -e "\n=== Looking for existing multi-threaded aggregate tests ==="
rg -n "supportsParallelism|parallel.*test|concurrent.*test" core/src/test/java/io/questdb/test/griffin/ --type java -A3 -B3 | head -50

Length of output: 3487


🏁 Script executed:

#!/bin/bash
# Check AbstractCairoTest for multi-threaded test support
echo "=== AbstractCairoTest methods ==="
rg -n "public.*void.*parallel|public.*void.*concurrent|public.*void.*thread" core/src/test/java/io/questdb/test/AbstractCairoTest.java

Length of output: 33


🏁 Script executed:

#!/bin/bash
# Look for other aggregate functions that support parallelism to see test patterns
echo "=== Functions that support parallelism ==="
rg -n "supportsParallelism.*true" core/src/main/java/io/questdb/griffin/engine/functions/groupby/ -l | head -10

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check if functions are in function_list.txt
echo "=== Checking function_list.txt for new functions ==="
cat core/src/main/resources/function_list.txt | grep -E "RegressionR2|RegressionRmse|RegressionTStats"

Length of output: 52


🏁 Script executed:

#!/bin/bash
# Look for the pattern of how parallel group by tests work
echo "=== Looking at ParallelGroupByFuzzTest ==="
rg -n "class ParallelGroupByFuzzTest" core/src/test/java/io/questdb/test/cairo/fuzz/ -A100 | head -150

Length of output: 12649


🏁 Script executed:

#!/bin/bash
# Check BinaryFunction interface for thread-safety implementation
echo "=== BinaryFunction interface ==="
fd -e java "BinaryFunction.java" core/src/main/java/io/questdb/griffin/engine/functions/ --exec cat {}

Length of output: 6021


🏁 Script executed:

#!/bin/bash
# Look for tests that test merge() functionality
echo "=== Looking for tests that verify merge functionality ==="
rg -n "\.merge\(" core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ -B3 -A3 | head -50

Length of output: 57


🏁 Script executed:

#!/bin/bash
# Check for object allocations in the critical path (aggregate method)
echo "=== Checking for new object allocations in implementations ==="
rg -n "new |new\[" core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionR2FunctionFactory.java
rg -n "new |new\[" core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionRmseFunctionFactory.java
rg -n "new |new\[" core/src/main/java/io/questdb/griffin/engine/functions/groupby/RegressionTStatsFunctionFactory.java

Length of output: 308


🏁 Script executed:

#!/bin/bash
# Check if other regression functions have parallel tests
echo "=== Looking for tests of existing regression functions ==="
fd -e java "Regression.*Test.java" core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/

Length of output: 599


🏁 Script executed:

#!/bin/bash
# Check the function_list.txt to understand its structure
echo "=== Examining function_list.txt structure ==="
head -50 core/src/main/resources/function_list.txt

echo -e "\n=== Check if existing regression functions are listed ==="
grep -i "regression" core/src/main/resources/function_list.txt

Length of output: 2896


🏁 Script executed:

#!/bin/bash
# Look at an existing regression function test to see if they have parallel tests
echo "=== RegressionSlopeFunctionFactoryTest ==="
cat core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionSlopeFunctionFactoryTest.java

Length of output: 5108


🏁 Script executed:

#!/bin/bash
# Check if there's a specific pattern for testing aggregate functions with parallel execution
echo "=== Looking for how aggregate function tests verify parallel behavior ==="
rg -n "parallel.*group.*by|PARALLEL_GROUPBY" core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ -i -A5 | head -50

Length of output: 73


🏁 Script executed:

#!/bin/bash
# Check test results for infinity handling
echo "=== Looking at how infinity is handled in tests ==="
rg -n "POSITIVE_INFINITY|NEGATIVE_INFINITY|Infinity" core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/RegressionTStatsFunctionFactoryTest.java -A2 -B2

Length of output: 51


@bluestreak01: I've completed a comprehensive review. Here are my findings:

✅ Implementation Completeness

Strengths:

  • All three functions properly implement GroupByFunction with merge() for parallel aggregation
  • Welford's algorithm correctly applied for numerically stable single-pass computation
  • Proper null handling with Numbers.isFinite() checks
  • Edge cases handled: zero variance, perfect fits, overflow scenarios
  • Consistent pattern matching existing RegressionSlopeFunctionFactory

⚠️ Critical Issues

1. Function Registration Missing

The new functions are not registered in function_list.txt. Only RegressionInterceptFunctionFactory appears in the file:

grep -i regression core/src/main/resources/function_list.txt
# Only shows: io.questdb.griffin.engine.functions.groupby.RegressionInterceptFunctionFactory

Action Required: Add these lines to core/src/main/resources/function_list.txt:

io.questdb.griffin.engine.functions.groupby.RegressionR2FunctionFactory
io.questdb.griffin.engine.functions.groupby.RegressionRmseFunctionFactory
io.questdb.griffin.engine.functions.groupby.RegressionTStatsFunctionFactory

2. No Multi-threaded Tests

Despite supportsParallelism() returning true and merge() being implemented, there are zero tests verifying parallel execution behavior. This is a pattern across all regression functions (including existing ones like RegressionSlopeFunctionFactory).

Action Required: Add parallel execution tests similar to ParallelGroupByFuzzTest. Example test needed:

@Test
public void testRegrR2Parallel() throws Exception {
    assertMemoryLeak(() -> {
        setProperty(PropertyKey.CAIRO_SQL_PARALLEL_GROUPBY_ENABLED, "true");
        execute("create table tbl1 as (select x, 2 * x + 5 as y from long_sequence(10000))");
        
        // Compare parallel vs serial results
        String expected = "regr_r2\n1.0\n";
        assertSql(expected, "select regr_r2(y, x) from tbl1");
    });
}

3. Test Inconsistency in TStats

Line 115 in RegressionTStatsFunctionFactory.java returns POSITIVE_INFINITY or NEGATIVE_INFINITY for perfect fits:

if (ssRes == 0) {
    return slope >= 0 ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY;
}

But tests expect "null" (Lines 48, 61, 73, 129 in test file). This suggests:

  • Either the code should return Double.NaN instead of infinity
  • Or tests should expect "Infinity" in output

Action Required: Clarify intent and align code with tests.

✅ Thread-Safety

Correct implementation:

  • isThreadSafe() delegates to BinaryFunction.super.isThreadSafe() which checks both argument functions
  • No shared mutable state between threads
  • aggregate() operates on MapValue (thread-local during aggregation phase)
  • merge() combines results without synchronization issues

✅ Performance & Allocations

Excellent - No issues found:

  • Zero allocations in hot path (aggregate() method) - only primitive operations
  • Single object allocation in newInstance() (acceptable, happens once per query)
  • Efficient single-pass algorithm with 6 values per group (meanY, meanX, sumXY, sumX, sumY, count)
  • No redundant computations on critical path

⚠️ Test Coverage Gaps

While you have 40 tests covering edge cases, several scenarios are missing:

  1. No merge() verification - parallel aggregation correctness untested
  2. No concurrent execution tests - race conditions not verified
  3. No large-scale data tests - only up to 1M rows with constant values
  4. No tests with actual regression data - all tests use perfect linear relationships or edge cases

Suggested additions:

@Test
public void testRegrR2WithNoise() throws Exception {
    // Test with actual regression where R² is not 1.0 or 0.0
    assertMemoryLeak(() -> {
        execute("create table tbl1 as (" +
                "select x, 2 * x + 5 + rnd_double() * 0.1 as y " +
                "from long_sequence(1000))");
        // R² should be close to but not exactly 1.0
        assertSql("regr_r2\n", "select regr_r2(y, x) from tbl1 where regr_r2(y, x) > 0.95 and regr_r2(y, x) < 1.0");
    });
}

Summary

Aspect Status Notes
Implementation ✅ Complete Welford's algorithm, proper edge cases
Registration Missing Not in function_list.txt
Thread-Safety ✅ Correct Proper delegation via BinaryFunction
Performance ✅ Excellent Zero allocations in hot path
Multi-threaded Tests Missing No parallel execution tests
Edge Case Tests ✅ Good 40 tests covering nulls, overflow, types
merge() Tests Missing Parallel aggregation untested
Test Consistency ⚠️ Issue TStats infinity vs null mismatch

Priority Actions:

  1. Add function registration to function_list.txt
  2. Add multi-threaded/parallel execution tests
  3. Fix TStats infinity handling inconsistency
  4. Add tests that verify merge() correctness
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai[bot] avatar Oct 31 '25 23:10 coderabbitai[bot]