Forward-Filling / Back-Filling Timeseries is Not Possible

Open jmsmdy opened this issue 2 years ago • 0 comments

What's up?

It appears there is no way to do this in PRQL right now (at least not cleanly):

SELECT 
      location,
      day,
      LAST_VALUE(temperature) IGNORE NULLS OVER (
          PARTITION BY location
          ORDER BY day
          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
     ) AS temperature,
     LAST_VALUE(pressure) IGNORE NULLS OVER  (
          PARTITION BY location
          ORDER BY day
          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
     ) AS pressure
FROM weather_data

Some attempts:

from weather_data
group {location} (
    sort day
    window rows:..0 (
        filter pressure != null
        derive {pres = last pressure}
    )
    window rows:..0 (
        filter temperature != null
        derive {temp = last temperature}
    )
)

Semantically, you would expect a filter in window (filter ... | derive ...) to filter the records in that window used for the later derivation, but instead the filter is applied to the entire table, which is not the right thing.

It doesn't appear possible to move the filter inside the derive clause (like derive {temp = temperature | x -> filter x !=null | last}, because filter is typed as a function relation -> relation. I'm guessing this probably won't end up being added to PRQL (absent a huge amount of work), because while DuckDB supports filter clauses in window functions, most don't.

Another attempt:

let last_non_null = col <array> -> <scalar> s"LAST({col}) IGNORE NULLS"

from weather_data
group {location} (
    sort day
    window range:..0 (
        derive {
            pres = last_non_null pressure,
            temp = last_non_null temperature
        }
    )
)

This fails (no matter how you type hint) because prqlc fails to recognize last_non_null as a window function. In any case, even if this worked, you now need to change the let ... statement depending on what SQL dialect you are compiling to.

The only way I have gotten this to work is to do the windowing (partitioning + sorting + rows-between + aggregation) entirely in the target SQL dialect using a big s-string, which is not very clean. The limitations of s-strings make it impossible to fully parameterize the windowing process (e.g. there is no way to make s-string valued function that expands tuples inside the s-string, in case you want your SQL-valued function to partition by a variable number of columns).

Mar 12 '24 14:03 jmsmdy