Python and C++ code refactoring
- Python code is now formatted with
blackandisort, and has been refactored according toPEP 8style guides. - Python code partially commented with
googledocstring format. - C++ code partially commented with
doxygendocstring format. - Scott's and Normal Reference Rule's
bandwidthcalculation have been reordered and commented. -
ArcOperatorSet::update_incoming_arcs_scoresformulas have been reordered and commented.
Thank you for the PR @carloslihu . I am reviewing the PR. I don't understand the bug in hillclimbing.hpp (lines 152-160). Can you provide more detail?
I see some commented tests in tests/learning/algorithms/hillclimbing_test.py (I think they are related), but I do not know where is the "generate_normal_data_dep(1000)" (line 9) is defined.
Hello David, thank you for reviewing my changes, they are a lot:
-
The bug I'm talking about is about performing HC with cross-validation when you have a variable with low variability (but not 0 variance). For instance, a variable A with training values [0, 0, ..., 0, 1]. When performing the validated k-fold cross-validation, it finds problems fitting the distribution when the "1" value is not found in the training fold. It is not a bug, more like a problem I found with some datasets. I do not have a clear solution for the algorithm. You may ignore the comment; I will remove it.
-
I still have pending to review that the python tests are still working and cleaning comments. It's a minor change, but I have planned to finish this month. Please wait for them, thanks
Another issue is that I have retrocompatibility with the tests until the tag v0.5.2 However, afterwards I started using for KDE by default the diagonal covariance as the bandwidth matrix, instead of the full covariance (as suggested here)
This KDE learning faster and removes the Singular Covariance Matrix error. Although it may return less accurate Bandwidth Matrices, in my experience, the differences are minimal.
If you like this change, I could adapt the tests so that they pass the results given this new method.
Thank you @carloslihu. I will try to create a test with a case of almost 0 variance to try to control it. First, I wanna know where it is raising the error.
About the diagonal matrices: yes, many authors recommend using diagonal matrices. Less parameters to estimate and usually small degradation of performance. That is why I defined ProductKDE (https://pybnesian.readthedocs.io/en/latest/api/factors.html#pybnesian.ProductKDE).
I know that ProductKDE is not exactly the same as KDE with diagonal bandwidth matrix, but:
- Do you remember if the methods to estimate the bandwidth are equal (in ProductKDE and diagonal KDE)?
- If previous question is no. Do you think we need an specific implementation for the estimation of bandwidth in diagonal KDEs?
Hello David, After analyzing the definition of PyBNesian ProductKDE and the definition of diagonal bandwidth matrix KDE found in the literature. I think they represent the same concept with different notation.
For my implementation I wanted to use ProductKDE for Hill Climbing, but by lack of skill and time, I wasn't able to do so. Therefore, my Hack Fix was to manually remove the non-diagonal elements from the covariance matrix in NormalReferenceRule.hpp and ScottsBandwidth.hpp
This is just a temporary fix, I think that the best solution would be to allow the option of allowing the selection of your ProductKDE implementation for the Hill Climbing algorithm, but my lack of knowledge wouldn't let me.
Hello David, For the moment I have rebased my master branch to tag v0.5.2 (before adapting my diagonal bandwidth fix) This way you may merge my latest changes without altering your functionality.
I will continue using the diagonal bandwidth in another personal branch (feature/diagonal-bandwidth) for the moment