[ENH] Column renaming
- Closes #2684
- closes #2504
Implementation for column renaming
from datatable import dt, f, by
grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
data = {'ID': ["x%d" % r for r in range(10)],
'Gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamYear': [2007, 2007, 2007, 2008, 2008,
2008, 2008, 2009, 2009, 2009],
'Class': ['algebra', 'stats', 'bio', 'algebra',
'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
'yes', 'yes', 'yes', 'yes', 'yes'],
'Passed': ['yes' if x > 50 else 'no' for x in grades],
'Employed': [True, True, True, False,
False, False, False, True, True, False],
'Grade': grades}
df = dt.Frame(data)
# proposal via this PR
df[:, dt.mean(f.Grade), by((f.ExamYear < 2009).alias('grp'))]
| grp Grade
| bool8 float64
-- + ----- -------
0 | 0 60.6667
1 | 1 70.8571
[2 rows x 2 columns]
@samukweku So what if we only introduce an f-method .as() but do not introduce a corresponding dt.as() function? Then we won't have to deal with the issues related to the Python built-in keyword as.
Otherwise, my feeling is that
dt.alias(f.C0, "newcol")
is not much better than just
{"newcol" : f.C0}
The latter even looks cleaner to me.
Btw, one of the tests is failing due to some documentation variable being missing.
@oleksiyskononenko that is so much better/cleaner and was the original idea of the PR. i just could not figure out how to write methods off the f symbol
@oleksiyskononenko how do I go about implementing f.as? or do you have plans for this that you would like to implement?
@samukweku I guess you already implemented many of f-methods here: https://github.com/h2oai/datatable/blob/main/src/core/expr/fexpr.cc However, your implementation was normally
- to import a function from the main
datatablemodule; - invoke this function.
Here you won't be able to do this, because we're not going to implement any new dt.as() function. So the way to go is to implement f.as() only as py::oobj PyFExpr::as(const XArgs&). Just like all these functions: https://github.com/h2oai/datatable/blob/main/src/core/expr/fexpr.cc#L235-L287 No need to implement an additional static py::oobj fn_as(const py::XArgs& args).
So essentially you just need to move your implementation from
static py::oobj pyfn_alias(const py::XArgs &args) {
YOUR_IMPLEMENTATION
}
to
oobj PyFExpr::alias(const XArgs& args) {
YOUR_IMPLEMENTATION
}
and make some other adjustments, so that it works.
been busy these past days on some other stuff ... I'll try and resume on this on the weekend
@oleksiyskononenko I'm not even sure what I'm doing here; I don't know how to directly set up a method on the f function. If you do not mind, kindly point me in the right direction. I could not follow along with the re.match or re.len examples
@samukweku Sure, let me rewrite what you've got here, so that you will see.
I will greatly appreciate that @oleksiyskononenko
@samukweku So I've fixed what you currently have to only support the f.as() method, but it appears as though Python won't support any method called as() and issues a SyntaxError immediately. So you're right and we need to invent some other name for this functionality. Another question is if we also want to support dt.as() or not:
- if not, then I can just push what I currently have an we just rename
f.as()to something else; - if yes, then we need to go back to your original implementation of
dt.alias()/f.alias(), but I'm not sure what was wrong with it.
@oleksiyskononenko having it as a method is better and cleaner; it will be in the same way as remove/ extend. f.alias is a good fit.
Having it as an f-method and also as dt function will not probably heart. I'm also starting to think that since we already have both dt.as_type() and f.as_type(), why don't we call a new functionality .as_name().
Not sure as_name is intuitive enough. rename like in pandas/dplyr/julia, alias like in pyspark and pypolars. I'm thinking of names that are already familiar to dataframe users
@oleksiyskononenko if we leave as is (allowing for both a function and a method), are we agreed on what the name should be? rename?alias? as_name? (not a fan of this last one, tbh) @pradkrish @vopani @Peter-Pasta any thoughts on this?
My feeling is that
-
renameis normally to rename existing columns in the frame; -
aliasis an alternative name to a column, i.e. it could be accessed with two or more different names; -
asis to select a column as something else, i.e.as_type()— as a different type,as_name()— under a different name, etc.
Since we can't use as due to Python restrictions, we should probably be more specific and indicate what exactly that "something else" is going to be.
as, as used in SQL, is to rename a column, so we are not far off if we use rename/alias
Also with this function we can rename columns and alias them. So it does not matter much IMO. alias just seems more direct - basically saying my original name is C0, but U can call me Jeremy now
@samukweku yeah, let me push updates to this PR to fix couple of things, then we could decide on the name.
So I've re-factored the function, while keeping the name dt.alias(). Made it work for both dt.alias() and FExpr.alias() cases, streamlined some code and improved tests.
Please see if everything makes sense to you and don't hesitate to ask any questions.
What we're missing here is documentation and also the final concord on the function name.
While we can leave it as alias, it may still be confusing. For instance, there is a pandas issue on the column aliasing with totally different meaning: https://github.com/pandas-dev/pandas/issues/11723
One of the options for the name I came with is name_as(). For f-methods it will look just fine, something like f[0].name_as("count") or f[:].name_as(["currency", "count"]) is pretty intuitive. However, it won't look so good if we use it through dt.name_as(f.C0, "count").
Thanks @oleksiyskononenko . I believe your name suggestion (name_as) is right. However, it should be only a method on the f symbol, and not as a dt function
@oleksiyskononenko I think we should go ahead and use the name_as but only as a method on f. If you don't mind, can you show how to do this, and I can use that as a template for future work maybe?
@samukweku That's not a big deal to have only the f-method, however, I don't think it will hurt if we keep a dt function also. For instance, as_type() is available in both forms:
- https://datatable.readthedocs.io/en/latest/api/dt/as_type.html
- https://datatable.readthedocs.io/en/latest/api/fexpr/as_type.html
The only advantage I see of keeping this as an f-method is that we could then support names as not only a list/tuple/string, but as a variable number of arguments. I.e. changing the signature from
name_as(names)
to
name_as(*names)
Then, instead of doing f[0:2].name_as(["C0", "C1"]) one could simply do f[0:2].name_as("C0", "C1").
@oleksiyskononenko the variable args is great! Totally forgot about it. I think it is a great idea for both function and method. I think name_as is ok - I'll always be biased toward alias 😁
Unfortunately, with the function dt.name_as(cols, names) it could be messy to use variable arguments. That's because both cols and names could consist of more than one element.
Does dt.also_as make sense? Or dt.aka?
I'll add the docs and make changes, while changing to name_as
@samukweku I guess you can leave it as alias in the docs. Since the as approach is called alias in SQL, it should be intuitive for the users. I will remove dt.alias() and will make FExpr.alias() to accept variable number of arguments.
@samukweku I found that I have already created documentation, just had to commit the files. Please take a look and if it's ok, we can merge this PR.
My bad @oleksiyskononenko I was having a look at it. Glad it has been merged though - it is a superb addition to the library
@samukweku oops, sorry, I saw your thumbs up and thought you already finished with the review. Anyways, please continue your review and if you see anything wrong or needs clarification, please open a separate PR with additions.