datatable icon indicating copy to clipboard operation
datatable copied to clipboard

[ENH] Column renaming

Open samukweku opened this issue 3 years ago • 5 comments

  • Closes #2684
  • closes #2504

Implementation for column renaming

from datatable import dt, f, by

grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
data = {'ID': ["x%d" % r for r in range(10)],
             'Gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
             'ExamYear': [2007, 2007, 2007, 2008, 2008,
                          2008, 2008, 2009, 2009, 2009],
             'Class': ['algebra', 'stats', 'bio', 'algebra',
                       'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
             'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
                              'yes', 'yes', 'yes', 'yes', 'yes'],
             'Passed': ['yes' if x > 50 else 'no' for x in grades],
             'Employed': [True, True, True, False,
                          False, False, False, True, True, False],
             'Grade': grades}

df = dt.Frame(data)

# proposal via this PR
df[:, dt.mean(f.Grade), by((f.ExamYear < 2009).alias('grp'))]

   |   grp    Grade
   | bool8  float64
-- + -----  -------
 0 |     0  60.6667
 1 |     1  70.8571
[2 rows x 2 columns]

samukweku avatar Aug 10 '22 10:08 samukweku

@samukweku So what if we only introduce an f-method .as() but do not introduce a corresponding dt.as() function? Then we won't have to deal with the issues related to the Python built-in keyword as.

Otherwise, my feeling is that

dt.alias(f.C0, "newcol")

is not much better than just

{"newcol" : f.C0}

The latter even looks cleaner to me.

Btw, one of the tests is failing due to some documentation variable being missing.

oleksiyskononenko avatar Aug 11 '22 19:08 oleksiyskononenko

@oleksiyskononenko that is so much better/cleaner and was the original idea of the PR. i just could not figure out how to write methods off the f symbol

samukweku avatar Aug 11 '22 20:08 samukweku

@oleksiyskononenko how do I go about implementing f.as? or do you have plans for this that you would like to implement?

samukweku avatar Aug 12 '22 04:08 samukweku

@samukweku I guess you already implemented many of f-methods here: https://github.com/h2oai/datatable/blob/main/src/core/expr/fexpr.cc However, your implementation was normally

  • to import a function from the main datatable module;
  • invoke this function.

Here you won't be able to do this, because we're not going to implement any new dt.as() function. So the way to go is to implement f.as() only as py::oobj PyFExpr::as(const XArgs&). Just like all these functions: https://github.com/h2oai/datatable/blob/main/src/core/expr/fexpr.cc#L235-L287 No need to implement an additional static py::oobj fn_as(const py::XArgs& args).

oleksiyskononenko avatar Aug 12 '22 05:08 oleksiyskononenko

So essentially you just need to move your implementation from

static py::oobj pyfn_alias(const py::XArgs &args) {
  YOUR_IMPLEMENTATION
}

to

oobj PyFExpr::alias(const XArgs& args) {
  YOUR_IMPLEMENTATION
}

and make some other adjustments, so that it works.

oleksiyskononenko avatar Aug 14 '22 19:08 oleksiyskononenko

been busy these past days on some other stuff ... I'll try and resume on this on the weekend

samukweku avatar Aug 19 '22 01:08 samukweku

@oleksiyskononenko I'm not even sure what I'm doing here; I don't know how to directly set up a method on the f function. If you do not mind, kindly point me in the right direction. I could not follow along with the re.match or re.len examples

samukweku avatar Aug 20 '22 07:08 samukweku

@samukweku Sure, let me rewrite what you've got here, so that you will see.

oleksiyskononenko avatar Aug 20 '22 07:08 oleksiyskononenko

I will greatly appreciate that @oleksiyskononenko

samukweku avatar Aug 20 '22 07:08 samukweku

@samukweku So I've fixed what you currently have to only support the f.as() method, but it appears as though Python won't support any method called as() and issues a SyntaxError immediately. So you're right and we need to invent some other name for this functionality. Another question is if we also want to support dt.as() or not:

  • if not, then I can just push what I currently have an we just rename f.as() to something else;
  • if yes, then we need to go back to your original implementation of dt.alias() / f.alias(), but I'm not sure what was wrong with it.

oleksiyskononenko avatar Aug 30 '22 23:08 oleksiyskononenko

@oleksiyskononenko having it as a method is better and cleaner; it will be in the same way as remove/ extend. f.alias is a good fit.

samukweku avatar Aug 31 '22 03:08 samukweku

Having it as an f-method and also as dt function will not probably heart. I'm also starting to think that since we already have both dt.as_type() and f.as_type(), why don't we call a new functionality .as_name().

oleksiyskononenko avatar Aug 31 '22 19:08 oleksiyskononenko

Not sure as_name is intuitive enough. rename like in pandas/dplyr/julia, alias like in pyspark and pypolars. I'm thinking of names that are already familiar to dataframe users

samukweku avatar Aug 31 '22 20:08 samukweku

@oleksiyskononenko if we leave as is (allowing for both a function and a method), are we agreed on what the name should be? rename?alias? as_name? (not a fan of this last one, tbh) @pradkrish @vopani @Peter-Pasta any thoughts on this?

samukweku avatar Sep 02 '22 21:09 samukweku

My feeling is that

  • rename is normally to rename existing columns in the frame;
  • alias is an alternative name to a column, i.e. it could be accessed with two or more different names;
  • as is to select a column as something else, i.e. as_type() — as a different type, as_name() — under a different name, etc.

Since we can't use as due to Python restrictions, we should probably be more specific and indicate what exactly that "something else" is going to be.

oleksiyskononenko avatar Sep 08 '22 22:09 oleksiyskononenko

as, as used in SQL, is to rename a column, so we are not far off if we use rename/alias

Also with this function we can rename columns and alias them. So it does not matter much IMO. alias just seems more direct - basically saying my original name is C0, but U can call me Jeremy now

samukweku avatar Sep 11 '22 04:09 samukweku

@samukweku yeah, let me push updates to this PR to fix couple of things, then we could decide on the name.

oleksiyskononenko avatar Sep 13 '22 18:09 oleksiyskononenko

So I've re-factored the function, while keeping the name dt.alias(). Made it work for both dt.alias() and FExpr.alias() cases, streamlined some code and improved tests.

Please see if everything makes sense to you and don't hesitate to ask any questions.

What we're missing here is documentation and also the final concord on the function name.

While we can leave it as alias, it may still be confusing. For instance, there is a pandas issue on the column aliasing with totally different meaning: https://github.com/pandas-dev/pandas/issues/11723

One of the options for the name I came with is name_as(). For f-methods it will look just fine, something like f[0].name_as("count") or f[:].name_as(["currency", "count"]) is pretty intuitive. However, it won't look so good if we use it through dt.name_as(f.C0, "count").

oleksiyskononenko avatar Sep 17 '22 05:09 oleksiyskononenko

Thanks @oleksiyskononenko . I believe your name suggestion (name_as) is right. However, it should be only a method on the f symbol, and not as a dt function

samukweku avatar Sep 17 '22 07:09 samukweku

@oleksiyskononenko I think we should go ahead and use the name_as but only as a method on f. If you don't mind, can you show how to do this, and I can use that as a template for future work maybe?

samukweku avatar Sep 17 '22 09:09 samukweku

@samukweku That's not a big deal to have only the f-method, however, I don't think it will hurt if we keep a dt function also. For instance, as_type() is available in both forms:

  • https://datatable.readthedocs.io/en/latest/api/dt/as_type.html
  • https://datatable.readthedocs.io/en/latest/api/fexpr/as_type.html

oleksiyskononenko avatar Sep 17 '22 22:09 oleksiyskononenko

The only advantage I see of keeping this as an f-method is that we could then support names as not only a list/tuple/string, but as a variable number of arguments. I.e. changing the signature from

name_as(names)

to

name_as(*names)

Then, instead of doing f[0:2].name_as(["C0", "C1"]) one could simply do f[0:2].name_as("C0", "C1").

oleksiyskononenko avatar Sep 17 '22 22:09 oleksiyskononenko

@oleksiyskononenko the variable args is great! Totally forgot about it. I think it is a great idea for both function and method. I think name_as is ok - I'll always be biased toward alias 😁

samukweku avatar Sep 17 '22 22:09 samukweku

Unfortunately, with the function dt.name_as(cols, names) it could be messy to use variable arguments. That's because both cols and names could consist of more than one element.

oleksiyskononenko avatar Sep 17 '22 22:09 oleksiyskononenko

Does dt.also_as make sense? Or dt.aka?

samukweku avatar Sep 17 '22 23:09 samukweku

I'll add the docs and make changes, while changing to name_as

samukweku avatar Sep 17 '22 23:09 samukweku

@samukweku I guess you can leave it as alias in the docs. Since the as approach is called alias in SQL, it should be intuitive for the users. I will remove dt.alias() and will make FExpr.alias() to accept variable number of arguments.

oleksiyskononenko avatar Sep 18 '22 05:09 oleksiyskononenko

@samukweku I found that I have already created documentation, just had to commit the files. Please take a look and if it's ok, we can merge this PR.

oleksiyskononenko avatar Sep 19 '22 22:09 oleksiyskononenko

My bad @oleksiyskononenko I was having a look at it. Glad it has been merged though - it is a superb addition to the library

samukweku avatar Sep 20 '22 06:09 samukweku

@samukweku oops, sorry, I saw your thumbs up and thought you already finished with the review. Anyways, please continue your review and if you see anything wrong or needs clarification, please open a separate PR with additions.

oleksiyskononenko avatar Sep 20 '22 06:09 oleksiyskononenko