ENH: R poly compatibility
Hello,
I've added the function poly which attempts to reproduce R's poly function.
See here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/poly.html
Thanks.
Coverage increased (+0.007%) to 98.667% when pulling d290dd30163170950315e0e3e398b51234757585 on thequackdaddy:poly_qr into 8b6c7121e7ec3f95da1fd9b9760945835abff8b3 on pydata:master.
Codecov Report
Merging #92 into master will increase coverage by
0.02%. The diff coverage is100%.
@@ Coverage Diff @@
## master #92 +/- ##
==========================================
+ Coverage 98.96% 98.98% +0.02%
==========================================
Files 30 32 +2
Lines 5585 5703 +118
Branches 775 791 +16
==========================================
+ Hits 5527 5645 +118
Misses 35 35
Partials 23 23
| Impacted Files | Coverage Δ | |
|---|---|---|
| patsy/__init__.py | 96.66% <100%> (+0.11%) |
:arrow_up: |
| patsy/polynomials.py | 100% <100%> (ø) |
|
| patsy/contrasts.py | 100% <100%> (ø) |
:arrow_up: |
| patsy/test_poly_data.py | 100% <100%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 4c613d0...4d107c6. Read the comment docs.
I realized after writing that this may relate to #20 .
But 20 points out some good questions...
- The code could be simplified to use the different
vandermethods that numpy provides. - Josef brings up a point that
qrmight be an unnecessary complication. I think simply scaling the values might get you to the same end. - My real need is for a method to provide some non-linear relationships between exogenous and endogenous data with the ability to extrapolate values not in the training data. So changing the scope of this PR to still accommodate for that might be a good initial step.
Thanks.
Is it possible for this to share the core implementation with patsy.contrasts.Poly? That was cribbed from R as well, and in principle at least they should be doing essentially the same thing...
Is it possible for this to share the core implementation with patsy.contrasts.Poly?
I would think so. Honestly I wrote this before I noticed the above noted, Issue that was already filed. Maybe I can incorporate those thoughts into this too.
@njsmith 3 changes...
-
patsy.contrast.Polyuses the same code aspatsy.polynomialsto generate theContrastMatrix. - Per some notes in the PR, I made it so that you can use any of the
XXXvandermethods from numpy. - In addition to QR-orthogonalization to "scale" the data, you can just use a straight standardizer
Also dumb question...
Does patsy require pandas now? I noticed that my codecov score is not the best, the reason appears to be that I'm skipping the have_pandas == False scenario. Not sure if that's even something we should be considering? Why write exception handling if the exception isn't supported?
Hmm, that's a ton of options there, between raw versus not, different polynomial bases, standardizing... I guess my first question is, are any of these... useful? Besides the boring basic R-compatible orthonormalized polynomials? Is there some compelling argument for using them in some particular situation, or an existing audience who expects them?
Regarding pandas: nominally at least we still don't depend on pandas. You're right that there ought to be tests for this, though, oops. Possibly this doesn't make sense anymore; it's a different world now than it was back in 2012 or whatever when this decision was made... OTOH I dunno if people using patsy with scikit-learn for example necessarily use/want pandas.
I've just tried adding a no-pandas test to the travis matrix -- I guess in ~10 minutes we'll know if (a) I'm any good at convincing travis to do what I want, (b) if so, whether the no-pandas branches actually work!
master branch is now testing the no-pandas configuration too, so the next time you push it then codecov should start calculating your stats more accurately :-)
Yeah I may have gone overboard there... :-/
I think QR vs. raw is necessary. Standardizing is kind of dumb... Not sure why you'd use it ever. AFAIU, the point of scaling in this way is just so that 1. the columns of data in the design matrix are relatively orthogonal and 2. when columns are of wildly different scale, the np.linalg.matrix_rank function won't have sufficient precision to correctly rank the matrix.
I don't really know the reasons for the differences in poly vs. chebyshev vs legende vs. laguerre. Josef poster mentioned that we should point to a generic XXXvander, I just assumed we let the user pick. I may have misinterpreted some of what Josef was trying to say.
You think I should drop all the other vanders other than polyvander and standardizing? I doubt I would use anything but that. Plus, my tests are vs. R, which is just QR-orthogonalized polyvander so the rest of this we are just assuming...
cc @josef-pkt - I know you're probably busy, but I'd appreciate your thoughts if there's any value in all this. Happy to remove this complexity if its not worthwhile.
I've simplified this back down so that it mimics R's poly function. Removed all the other vanders. When raw=False (the default) it will use QR-to orthogonalize a polyvander.
@njsmith - Its been over a year, but I think this is close to being good enough. Any comments? Thanks!
Any update on this PR?
Is this merged yet?
Hi @thequackdaddy ! I took a quick look through, and this looks good to me. Let me know if this is still good to merge as is. If so, I'll merge it in :).
Support for poly has now been merged into formulaic.