patsy icon indicating copy to clipboard operation
patsy copied to clipboard

ENH: R poly compatibility

Open thequackdaddy opened this issue 9 years ago • 16 comments

Hello,

I've added the function poly which attempts to reproduce R's poly function.

See here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/poly.html

Thanks.

thequackdaddy avatar Sep 15 '16 01:09 thequackdaddy

Coverage Status

Coverage increased (+0.007%) to 98.667% when pulling d290dd30163170950315e0e3e398b51234757585 on thequackdaddy:poly_qr into 8b6c7121e7ec3f95da1fd9b9760945835abff8b3 on pydata:master.

coveralls avatar Sep 15 '16 01:09 coveralls

Codecov Report

Merging #92 into master will increase coverage by 0.02%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
+ Coverage   98.96%   98.98%   +0.02%     
==========================================
  Files          30       32       +2     
  Lines        5585     5703     +118     
  Branches      775      791      +16     
==========================================
+ Hits         5527     5645     +118     
  Misses         35       35              
  Partials       23       23
Impacted Files Coverage Δ
patsy/__init__.py 96.66% <100%> (+0.11%) :arrow_up:
patsy/polynomials.py 100% <100%> (ø)
patsy/contrasts.py 100% <100%> (ø) :arrow_up:
patsy/test_poly_data.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4c613d0...4d107c6. Read the comment docs.

codecov-io avatar Sep 15 '16 01:09 codecov-io

I realized after writing that this may relate to #20 .

But 20 points out some good questions...

  • The code could be simplified to use the different vander methods that numpy provides.
  • Josef brings up a point that qr might be an unnecessary complication. I think simply scaling the values might get you to the same end.
  • My real need is for a method to provide some non-linear relationships between exogenous and endogenous data with the ability to extrapolate values not in the training data. So changing the scope of this PR to still accommodate for that might be a good initial step.

Thanks.

thequackdaddy avatar Sep 16 '16 23:09 thequackdaddy

Is it possible for this to share the core implementation with patsy.contrasts.Poly? That was cribbed from R as well, and in principle at least they should be doing essentially the same thing...

njsmith avatar Oct 25 '16 20:10 njsmith

Is it possible for this to share the core implementation with patsy.contrasts.Poly?

I would think so. Honestly I wrote this before I noticed the above noted, Issue that was already filed. Maybe I can incorporate those thoughts into this too.

thequackdaddy avatar Oct 25 '16 21:10 thequackdaddy

@njsmith 3 changes...

  1. patsy.contrast.Poly uses the same code as patsy.polynomials to generate the ContrastMatrix.
  2. Per some notes in the PR, I made it so that you can use any of the XXXvander methods from numpy.
  3. In addition to QR-orthogonalization to "scale" the data, you can just use a straight standardizer

thequackdaddy avatar Oct 26 '16 03:10 thequackdaddy

Also dumb question...

Does patsy require pandas now? I noticed that my codecov score is not the best, the reason appears to be that I'm skipping the have_pandas == False scenario. Not sure if that's even something we should be considering? Why write exception handling if the exception isn't supported?

thequackdaddy avatar Oct 26 '16 04:10 thequackdaddy

Hmm, that's a ton of options there, between raw versus not, different polynomial bases, standardizing... I guess my first question is, are any of these... useful? Besides the boring basic R-compatible orthonormalized polynomials? Is there some compelling argument for using them in some particular situation, or an existing audience who expects them?

Regarding pandas: nominally at least we still don't depend on pandas. You're right that there ought to be tests for this, though, oops. Possibly this doesn't make sense anymore; it's a different world now than it was back in 2012 or whatever when this decision was made... OTOH I dunno if people using patsy with scikit-learn for example necessarily use/want pandas.

I've just tried adding a no-pandas test to the travis matrix -- I guess in ~10 minutes we'll know if (a) I'm any good at convincing travis to do what I want, (b) if so, whether the no-pandas branches actually work!

njsmith avatar Oct 26 '16 05:10 njsmith

master branch is now testing the no-pandas configuration too, so the next time you push it then codecov should start calculating your stats more accurately :-)

njsmith avatar Oct 26 '16 05:10 njsmith

Yeah I may have gone overboard there... :-/

I think QR vs. raw is necessary. Standardizing is kind of dumb... Not sure why you'd use it ever. AFAIU, the point of scaling in this way is just so that 1. the columns of data in the design matrix are relatively orthogonal and 2. when columns are of wildly different scale, the np.linalg.matrix_rank function won't have sufficient precision to correctly rank the matrix.

I don't really know the reasons for the differences in poly vs. chebyshev vs legende vs. laguerre. Josef poster mentioned that we should point to a generic XXXvander, I just assumed we let the user pick. I may have misinterpreted some of what Josef was trying to say.

You think I should drop all the other vanders other than polyvander and standardizing? I doubt I would use anything but that. Plus, my tests are vs. R, which is just QR-orthogonalized polyvander so the rest of this we are just assuming...

cc @josef-pkt - I know you're probably busy, but I'd appreciate your thoughts if there's any value in all this. Happy to remove this complexity if its not worthwhile.

thequackdaddy avatar Oct 26 '16 06:10 thequackdaddy

I've simplified this back down so that it mimics R's poly function. Removed all the other vanders. When raw=False (the default) it will use QR-to orthogonalize a polyvander.

thequackdaddy avatar Nov 04 '16 01:11 thequackdaddy

@njsmith - Its been over a year, but I think this is close to being good enough. Any comments? Thanks!

thequackdaddy avatar Feb 28 '18 14:02 thequackdaddy

Any update on this PR?

has2k1 avatar Sep 10 '19 14:09 has2k1

Is this merged yet?

jonathan-taylor avatar Sep 26 '19 16:09 jonathan-taylor

Hi @thequackdaddy ! I took a quick look through, and this looks good to me. Let me know if this is still good to merge as is. If so, I'll merge it in :).

matthewwardrop avatar Sep 07 '21 18:09 matthewwardrop

Support for poly has now been merged into formulaic.

matthewwardrop avatar Oct 16 '21 05:10 matthewwardrop