[FEAT] Add MultiAggJoiner, refactor AggJoiner
Sharing my WIP for issue https://github.com/skrub-data/skrub/issues/810
Adding the MultiAggJoiner also means refactoring the AggJoiner to match the Joiner, in particular:
- Adding a
keyargument - Supporting polars in the test suite
- Only accepting a single table as an input
- Performing multiple small checks instead of one big check
Notes:
- For now, the check concerning duplicate column names happens after the aggregation in
AggJoiner. Later we / I could move it before aggregation, I didn’t want to touch it for now since it is linked to the_polars.aggregateand_pandas.aggregatefunctions, let me know how you feel about this - Should I change the landing page to show the multijoiners, or keep
JoinerandAggJoiner?
Also, thanks @jeromedockes & @Vincent-Maladiere for the discussions !
I'm unsure about the default suffixes in MultiAggJoiner. If we add suffixes to the tables to be joined based on their position in the auxiliary table list, it will have a different behavior thanAggJoiner. I think AggJoiner should have the same behavior than MultiAggJoiner when given a list of 1 dataframe
WDYT ?
I want to be a bit more consistent with parameter names. In the MultiAggJoiner, I think we want to keep all plurals i.e. aux_tables, keys, main_keys, aux_keys, cols, operations and suffixes.
But in the AggJoiner, there are currently aux_table, key, main_key, aux_key, cols (notice the plural), operation (no plural), suffix.
-
aux_tableandsuffixneed to stay singular, but I think we could go either way for the others - To be consistent with
aux_tableandsuffix(and to contrast with theMultiAggJoiner), we could go singular for all the others, but I thinkcolsandoperationsare going to be lists in most cases and so plural makes sense for them. We could also have a mix, and just changeoperationforoperations
Wondering how you feel about this @jeromedockes
so IIUC: AggJoiner:
aux_table, key, main_key, aux_key, suffix, cols, operations
MultiAggJoiner:
aux_tables, keys, main_keys, aux_keys, suffixes, cols, operations
is that correct? sounds ok to me
You need a changelog entry for this (adding an entry in doc/CHANGES.rst)