Make automatic centering in PCA methods optional
- Closes: https://github.com/dask/dask-ml/issues/734.
- Tests/
black/pyflakespassing: Yes. - Tests added to the following files:**
-
tests/test_pca.py -
tests/test_incremental_pca.py
-
- tl;dr of design choices made:
-
IncrementalPCAdoesn't get affected by the new parameter andIncrementalPCA(..., center=False ,...)is unsupported. -
whiten=Truedoes not imply, nor enforcecenter=True. Instead it warns about possible unexpected behavior, if the underlying data hasn't been centered already.
-
Once the maintainers confirm approval of the proposed changes, I'll be happy to update the .rst docs accordingly, to document the updated API, in a separate commit (as part of this PR).
From the latest CI runs (e.g., this job).
black, version 19.10b0
would reformat /home/vsts/work/1/s/tests/test_pca.py
Oh no! 💥 💔 💥
1 file would be reformatted, 99 files would be left unchanged.
Oops - pretty sure that I did this check locally. Will re-run black to confirm.
Update: This is strange. I'll check if my local branch has actually propagated, and is the same as what the Azure pipeline has picked up.
Result of ./ci/code_checks.sh run locally from the root dir of `dask-ml`:
Checking flake8...
Checking flake8... DONE
Checking black...
black, version 20.8b1
would reformat /git/dask-ml/dask_ml/model_selection/utils_test.py
would reformat /git/dask-ml/tests/ensemble/test_blockwise.py
would reformat /git/dask-ml/tests/model_selection/test_keras.py
would reformat /git/dask-ml/tests/model_selection/test_hyperband.py
would reformat /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection.py
would reformat /git/dask-ml/tests/model_selection/test_incremental.py
Oh no! 💥 💔 💥
6 files would be reformatted, 94 files would be left unchanged.
Checking black... DONE
Checking isort...
5.6.4
ERROR: /git/dask-ml/tests/test_spectral_clustering.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/test_incremental_pca.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/linear_model/test_glm.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection_sklearn.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/docs/dimensions.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/dask_ml/_compat.py Imports are incorrectly sorted and/or formatted.
...
Update (2): Seems that it's the same code version, both locally and remotely.
Locally:
$ git show
commit 01dd9d4f71ee380e5b3e1a969f3bd46bebda6cd9 (HEAD -> 734-pca-skip-centering, origin/734-pca-skip-centering)
...
Add param validation to PCA classes (#808)
Remotely: The most recent commit (and only one since Tom's review): https://github.com/dask/dask-ml/pull/808/commits/01dd9d4f71ee380e5b3e1a969f3bd46bebda6cd9.
I'll try reproducing black's results via the same version as per the CI pipeline (black, version 19.10b0).
Update (3): Victory! It was indeed due to the different local version of black==black-20.8b1 vs remote (black==19.10b0, as per .pre-commit-config.yaml and ci/*.yaml).
Hi @TomAugspurger - I'm writing to check if you've had a chance to look into the further changes that I pushed since your review, and whether you'd like any additional refinements to be done on top of that. Thanks!
It'll be a week or so before I can take a look. Thanks for your patience!
In the meantime, perhaps @jameslamb has a chance to glance through this (no worries if you don't though)?
It'll be a week or so before I can take a look. Thanks for your patience!
It's all good - just wanted to check if this is still on your radar. I'm happy to wait as long as necessary, and incorporate potential further feedback. Thanks for your confirmation, @TomAugspurger!
I had some time to go through and review today. Overall, I'm in favor of the change and I can understand how it could be a nice optimization for experienced users.
Please consider some of the review comments I left.
I also think it would improve understanding for people who find this pull request from search or see it in a changelog if you change the PR title to something like "Make automatic centering in PCA methods optional". By itself, the fact that there is now a parameter called
centerdoesn't tell you much about the functional benefit of the changes in this pull request.
Thanks for your review, @jameslamb! I'll try to address your comments and make changes, wherever applicable, ASAP.