patsy icon indicating copy to clipboard operation
patsy copied to clipboard

patsy pickles

Open thequackdaddy opened this issue 8 years ago • 18 comments

Hello,

This is a continuation effort of #86 and #67. My apologies if this is viewed as stepping on someone else's toes... but this is an interesting--and useful topic--for me.

I've basically got this working (with some simple tests on real data), but some things break some of the unit tests.

One problem is that as far as I can tell, doing a pickle.loads(pickle.dumps(obj)) will yield a different object, assigned to different memory than the original obj. Thus, the typical round-trip test obj == pickle.loads(pickle.dumps(obj)) fails, unless you just check that the obj.__dict__ matches. However, that breaks some things like... this line

I've also run across some issues with the stateful transforms like center, standardize, etc. These are functions (and from my rudimentary knowledge of computer science, pickling functions is generally a no-no). I need to set the __qualname__ and __name__ of these functions because when you generally do...

> center = stateful_transform(Center)
> center.__qualname__
'Center'

But when you pickle.dump, it realizes that Center refers to the class, and center is a function. So I have to hard-change the __qualname__ to get this to work. See here. This seems undesireable.

Anyways, let me know if there's any interest in this work. I'm by no means an expert at this... and the test cases could obviously be improved. But I'm not a programmer by trade, so please grant me some forbearance.

Thanks!

thequackdaddy avatar Apr 06 '17 00:04 thequackdaddy

Oh... and if you care to see the test-suite results... I ran it here... https://travis-ci.org/thequackdaddy/patsy/jobs/219101866

Some small things fail here, but I think it works overall. Namely there is a requirement that 2 generatedEvalEnvironments don't equal each other, but that contradicts the idea that a pickle.loads(pickle.dumps(obj)) == obj

Sorry, no coverage available.

thequackdaddy avatar Apr 06 '17 00:04 thequackdaddy

General comments:

  • It's very annoying that pickle doesn't accept module objects :-(. The code for working around that seems basically reasonable, though maybe could be a bit cleaner.

  • I don't understand why stateful transform functions like center need to be pickleable? the instantiated transform objects certainly do (and they need real __getstate__ methods :-P), but a fitted model shouldn't be referring to the functions themselves anymore, I would think... maybe we need to refactor a bit to make this true.

  • I'm not a big fan of all the __eq__ changes – this is a globally visible change, and the definitions are just targeted to the idiosyncratic needs of our tests :-(. (And btw, they're all buggy, because in python you have to override __ne__ too... I forget the exact rules, this might be fixed in py3, but definitely in py2.) Maybe a better approach would be to write end-to-end tests, that run some data through an unpickled model and checks that the result matches?

njsmith avatar Apr 06 '17 01:04 njsmith

But overall: this isn't ready yet, but it's very exciting to see someone working on it :-)

njsmith avatar Apr 06 '17 01:04 njsmith

Thanks. I'll try and take a look at this tomorrow.

I think you caught a few of the shortcuts I was trying... I'll clean up as appropriate.

As the the __eq__ issue... i was also thinking I could define my own assert statement that would check that the __dict__ and all nested __dict__ were equal. Maybe call it assert_pickle_equal. Some items, like EvalEnvironment, I don't see how to pass data through.

Thanks for your excellent comments. Hopefully I'll have an update tomorrow afternoon-ish.

thequackdaddy avatar Apr 06 '17 02:04 thequackdaddy

Update for today:

https://travis-ci.org/thequackdaddy/patsy/builds/219481856 (all green)

I've gone through/reverted your code as you recommended. Thanks for your pointers.

I haven't figured out why stateful transform function like center are in the EvalEnvironment. I quickly tried to remove it, but got an error because patsy later looked for it. (Not sure if it actually uses it... need to follow the code some more.)

But this seems do-able.

thequackdaddy avatar Apr 06 '17 22:04 thequackdaddy

This has been very educational. Thanks for the help.

I took a stab at fixing the issue where the patsy stateful transforms required __qualname__ to be speicified. I think that once the design_info has been built, it should be safe to just remove the functions.... So I added a clean method on EvalEnvironment that removes them.

Here's a gist showing that I think this works.

thequackdaddy avatar Apr 07 '17 01:04 thequackdaddy

That looks like it will work, but I think it should be possible to solve in a more fundamental way. Looking at memorize_passes_needed in EvalFactor I think I see the problem: there's some code that pulls out the set of names referenced in the code (using the ast_names helper function), and then reduces the EvalEnvironment to just those names (eval_env.subset). This was added in #66 as preparation for the pickling support -- it's why trying to pickle an EvalEnvironment doesn't try to save your entire Python environment :-).

And then if you look a few lines after we compute the set of variables to keep, then there's some code that rewrites the code to convert stateful transform calls like center(x) into method calls on a newly allocate transform object.

So I think the best solution is to swap the order of these things: if we make the list of variables to keep after eliminating references to stateful transforms, then it should just work.

I know this is getting into somewhat deep wizardry... I'm going to try to look at this more closely within the next few days, and I can fix this part up if you don't get there first :-)

njsmith avatar Apr 07 '17 05:04 njsmith

You are correct... simply swapping out the order of when the subset is done (and doing the subset on the replaced funcall code) seems like it does the trick.

There are 2 tests that would fail because of this change because they look for the stateful transforms in the eval_env.

here and here

So I've changed the tests to do the opposite... assert that the keys aren't in the namespaces.

And here's travis

thequackdaddy avatar Apr 08 '17 06:04 thequackdaddy

@njsmith Wanted to check in and see if you had any other thoughts on this. Seems like it works for me but happy to work it more if you think this is something that is merge-able

thequackdaddy avatar Apr 18 '17 01:04 thequackdaddy

So I built this on a windows server 2016 machine and I got a couple of nosetest failures:

======================================================================
ERROR: patsy.test_pickling.test_pickling_old_versions_still_work('C:\\Users\\pquackenbush\\git\\patsy\\patsy\\..\\pickle_testcases\\0.5\\transform_center.pickle',)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\bleeding\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Users\pquackenbush\git\patsy\patsy\test_pickling.py", line 137, in check_pickling_old_versions_still_work
    pickle.load(f))
TypeError: data type "f16" not understood

======================================================================
ERROR: patsy.test_pickling.test_pickling_old_versions_still_work('C:\\Users\\pquackenbush\\git\\patsy\\patsy\\..\\pickle_testcases\\0.5\\transform_standardize_norescale.pickle',)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\bleeding\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Users\pquackenbush\git\patsy\patsy\test_pickling.py", line 137, in check_pickling_old_versions_still_work
    pickle.load(f))
TypeError: data type "f16" not understood

======================================================================
ERROR: patsy.test_pickling.test_pickling_old_versions_still_work('C:\\Users\\pquackenbush\\git\\patsy\\patsy\\..\\pickle_testcases\\0.5\\transform_standardize_rescale.pickle',)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\bleeding\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Users\pquackenbush\git\patsy\patsy\test_pickling.py", line 137, in check_pickling_old_versions_still_work
    pickle.load(f))
TypeError: data type "f16" not understood

----------------------------------------------------------------------
Ran 50 tests in 0.281s

Any hints before I start digging into this?

I wonder if this is a hint I should use appveyor too...

thequackdaddy avatar Apr 18 '17 17:04 thequackdaddy

@njsmith Ok I dug into this a bit...

I think the issue is that 'f16' is the dtype for np.float128. The computer I ran this on--at the time--had np.float128 available, but my Windows Server 2016 computer does not have that dtype available. (I'm a little confused because I can't seem to figure out what I did that got the float128 datatype to show up in the first place... Both are 64 bit systems... one is a laptop, one is a xeon e5 virtual machine).

patsy tries to promote the data to a "wider" dtype here. So if you make the pickle on a computer with np.float128, then try to load it on a computer without that available, I think this will happen.

Options:

  • Maybe only promote up to np.float64.
  • Have an error handle that says when you load a datatype that doesn't exist on the current environment, to gently fail.
  • re-write tests to not have this issue

Thoughts?

thequackdaddy avatar Apr 26 '17 17:04 thequackdaddy

@njsmith Pinging to see if there is any interest in reviewing/merging this.

Thanks!

thequackdaddy avatar Jun 06 '17 17:06 thequackdaddy

@njsmith Hello again. Just pinging to see if there's interest in this... would be a little helpful for me to cut down on some model sizes.

thequackdaddy avatar Dec 21 '17 02:12 thequackdaddy

Codecov Report

Merging #104 into master will decrease coverage by 0.65%. The diff coverage is 89.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
- Coverage   99.08%   98.43%   -0.66%     
==========================================
  Files          30       31       +1     
  Lines        5557     5928     +371     
  Branches      773      822      +49     
==========================================
+ Hits         5506     5835     +329     
- Misses         28       63      +35     
- Partials       23       30       +7
Impacted Files Coverage Δ
patsy/mgcv_cubic_splines.py 99.14% <100%> (+0.03%) :arrow_up:
patsy/splines.py 100% <100%> (ø) :arrow_up:
patsy/origin.py 96.72% <100%> (+0.29%) :arrow_up:
patsy/design_info.py 99.64% <100%> (+0.01%) :arrow_up:
patsy/constraint.py 100% <100%> (ø) :arrow_up:
patsy/state.py 100% <100%> (ø) :arrow_up:
patsy/desc.py 98.46% <100%> (+0.11%) :arrow_up:
patsy/missing.py 99.09% <100%> (+0.03%) :arrow_up:
patsy/contrasts.py 100% <100%> (ø) :arrow_up:
patsy/test_highlevel.py 99.36% <100%> (+0.02%) :arrow_up:
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3b7b9c2...7129507. Read the comment docs.

codecov-io avatar Feb 27 '18 16:02 codecov-io

@njsmith - Checking in again to see if this is on the right track/something that could be merged. Let me know either way. Thanks so much!

thequackdaddy avatar Feb 28 '18 14:02 thequackdaddy

@thequackdaddy @njsmith how is this PR looking?

dswah avatar Apr 08 '18 06:04 dswah

@dswah I'm waiting for review/feedback... This is definitely a big project, and part of it--such as the whole assert_pickled_equals is difficult to check.

If you want to read/through, provide feedback, I think that might be helpful.

thequackdaddy avatar Apr 08 '18 12:04 thequackdaddy

Hi @thequackdaddy ,

Thanks for these patches. As one of the new maintainers I'm going through the PRs and figuring out what still needs attention. Even though your patches have been sitting here for along time, the code-base hasn't change much and so it's likely they are still suitable. Is this something you are still interested in hashing out?

Given that the name of the game from a maintenance perspective is to keep things working as is, I'm a little cautious to introduce massive changes that might introduce new bugs. It should also be noted that https://github.com/matthewwardrop/formulaic (the spiritual successor of this project) already has support for pickling, as well as sparse representations and is much more performant. But if you'd like to explore finishing up this work, I'm willing to dig into it with you.

Best, M

matthewwardrop avatar Sep 07 '21 17:09 matthewwardrop