skrub icon indicating copy to clipboard operation
skrub copied to clipboard

[FEAT] Add MultiAggJoiner, refactor AggJoiner

Open TheooJ opened this issue 2 years ago • 4 comments

Sharing my WIP for issue https://github.com/skrub-data/skrub/issues/810

Adding the MultiAggJoiner also means refactoring the AggJoiner to match the Joiner, in particular:

  • Adding a key argument
  • Supporting polars in the test suite
  • Only accepting a single table as an input
  • Performing multiple small checks instead of one big check

Notes:

  • For now, the check concerning duplicate column names happens after the aggregation in AggJoiner. Later we / I could move it before aggregation, I didn’t want to touch it for now since it is linked to the _polars.aggregate and _pandas.aggregate functions, let me know how you feel about this
  • Should I change the landing page to show the multijoiners, or keep Joiner and AggJoiner ?

Also, thanks @jeromedockes & @Vincent-Maladiere for the discussions !

TheooJ avatar Jan 16 '24 20:01 TheooJ

I'm unsure about the default suffixes in MultiAggJoiner. If we add suffixes to the tables to be joined based on their position in the auxiliary table list, it will have a different behavior thanAggJoiner. I think AggJoiner should have the same behavior than MultiAggJoiner when given a list of 1 dataframe

WDYT ?

TheooJ avatar Jan 17 '24 12:01 TheooJ

I want to be a bit more consistent with parameter names. In the MultiAggJoiner, I think we want to keep all plurals i.e. aux_tables, keys, main_keys, aux_keys, cols, operations and suffixes.

But in the AggJoiner, there are currently aux_table, key, main_key, aux_key, cols (notice the plural), operation (no plural), suffix.

  • aux_table and suffix need to stay singular, but I think we could go either way for the others
  • To be consistent with aux_table and suffix (and to contrast with the MultiAggJoiner), we could go singular for all the others, but I think cols and operations are going to be lists in most cases and so plural makes sense for them. We could also have a mix, and just change operation for operations

Wondering how you feel about this @jeromedockes

TheooJ avatar Feb 10 '24 17:02 TheooJ

so IIUC: AggJoiner:

aux_table, key, main_key, aux_key, suffix, cols, operations

MultiAggJoiner:

aux_tables, keys, main_keys, aux_keys, suffixes, cols, operations

is that correct? sounds ok to me

jeromedockes avatar Feb 12 '24 14:02 jeromedockes

You need a changelog entry for this (adding an entry in doc/CHANGES.rst)

GaelVaroquaux avatar Feb 26 '24 09:02 GaelVaroquaux