bigquery-utils icon indicating copy to clipboard operation
bigquery-utils copied to clipboard

Missing values are not taken into account in UDF "median".

Open ryota-tanimoto opened this issue 4 years ago • 0 comments

Since this UDF does not take missing values into account in the calculation, I believe it needs to be included in "README.md" or the function needs to be modified.

Currently, when an array containing missing values is given, the length of the array is calculated as is. Therefore, in the following cases, unintended results may be returned. I think we need to add a function that arbitrarily removes or completes the missing values.

Current source code

SELECT IF (
  MOD(ARRAY_LENGTH(arr), 2) = 0,
  (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
  arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)

Input

DECLARE arr ARRAY<INT64>;
SET arr = [1, 2, 3, 4, 5, 6, NULL];

Current result

3.0

The real result (Remove missing values)

3.5

So I would like to propose a process using the existing function "PERCENTILE_CONT" in the following form.

SELECT
  PERCENTILE_CONT(val, 0.5) OVER()
FROM UNNEST(arr) AS val 
LIMIT 1

ryota-tanimoto avatar Jan 31 '22 04:01 ryota-tanimoto