nnpdf icon indicating copy to clipboard operation
nnpdf copied to clipboard

Diagonal covmat

Open jacoterh opened this issue 10 months ago • 20 comments

This PR performs the training validation split in the diagonal basis of the t0 covariance matrix.

Without any split, so pure training, the loss must be the same regardless of whether one diagonalises or not. Indeed, at float64 precision, with this test runcard I find,

float64, non diagonal

Epoch 100/17000: loss: 5.3918877
Epoch 200/17000: loss: 5.3818284
Epoch 300/17000: loss: 5.3818996
Epoch 400/17000: loss: 5.0928206
Epoch 500/17000: loss: 5.2855544
Epoch 600/17000: loss: 5.0460824
Epoch 700/17000: loss: 5.1262326
Epoch 800/17000: loss: 5.0686013
Epoch 900/17000: loss: 4.9429147
Epoch 1000/17000: loss: 5.1852478
Epoch 2000/17000: loss: 4.7091641
Epoch 4000/17000: loss: 4.3651678


float64, diagonal

Epoch 100/17000: loss: 5.3918877
Epoch 200/17000: loss: 5.3818284
Epoch 300/17000: loss: 5.3818996
Epoch 400/17000: loss: 5.0928206
Epoch 500/17000: loss: 5.2850301
Epoch 600/17000: loss: 5.4309732
Epoch 700/17000: loss: 5.1401602
Epoch 800/17000: loss: 5.1322600
Epoch 900/17000: loss: 5.1281818
Epoch 1000/17000: loss: 5.1255253
Epoch 1500/17000: loss: 4.9968909
Epoch 2000/17000: loss: 4.5648582
Epoch 2500/17000: loss: 4.4773234
Epoch 3000/17000: loss: 4.4334937
Epoch 4000/17000: loss: 4.3631501
Epoch 5000/17000: loss: 4.1540497

Up to the precision printed here, the losses remain in sync up to around epoch 500, while after that they start showing some small differences. But this we can explain: a tiny deviation in the loss due to precision in the beginning also leads to another update of the NNs and hence a different path through the loss landscape. What matters is that after training is done, the difference at the pdf level is of the same order as changing the seed of the NNs.

Indeed, in a DIS only fit at float64 precision, I find https://vp.nnpdf.science/UobKaEuXQf-BwMeMN_Z2Mw==/ where the distance plots show that the difference between non-diagonal and diagonal are well within one sigma.

The diagonal basis can be activated in the runcard by adding diagonal_basis: True at the top level of the yaml. With this, the implementation is ready, but we should perform the following checks with a non-zero training/validation split:

  • [x] using closure pseudodata where we remove any possible hidden correlations: https://vp.nnpdf.science/xBMmCMHjQ5yB5_l0uVK_Aw==/
  • [x] Non-diagonal basis reproduces master: https://vp.nnpdf.science/obaVuxjRSv-4fyLpg80jhw==/
  • [x] Using real data: https://vp.nnpdf.science/fTfKodChTtKShpHWpiRdCg==/
  • [ ] using inconsistent closure pseudodata

jacoterh avatar Mar 17 '25 16:03 jacoterh

Greetings from your nice fit :robot: ! I have good news for you, I just finished my tasks:

  • Fit Name: NNBOT-a619ec218-2025-04-02
  • Fit Report wrt master: https://vp.nnpdf.science/jx3CthBfSb27Qz1FKfnJtw==
  • Fit Report wrt latest stable reference: https://vp.nnpdf.science/_GaQ77Z0RVu8EwK-p1-2ig==
  • Fit Data: https://data.nnpdf.science/fits/NNBOT-a619ec218-2025-04-02.tar.gz

Check the report carefully, and please buy me a :coffee: , or better, a GPU :wink:!

github-actions[bot] avatar Apr 02 '25 15:04 github-actions[bot]

I wonder where the difference is coming from?

How are the cuts applied differently (the training/validation cuts?) what would happen if you do the closure fit without tr/vl split like in your test?

Perhaps we should do this in two steps: 1) First apply tr/vl cuts, validate that resutls are the same 2) Then change cuts from databaset-by-dataset to full-dataset Alternatively do a 1000-replica fit where hopefully the difference due to how we cut away the data is washed away... I guess doing the cut in a dataset-per-dataset basis has some merit as it ensures that for every replica for every process and x-q coverage there's a contribution to both training and validation.

scarlehoff avatar Apr 08 '25 07:04 scarlehoff

Hi @scarlehoff , not sure I understand what you are suggesting, could you be more specific? Even though the training/validation ratio is specified per dataset, i.e. frac: 0.75, the actual split is done in the diagonal basis, which means that the 75% ratio no longer applies to any specific dataset, but to all datasets collectively. The rotation mixes datapoints across different datasets after all.

jacoterh avatar Apr 08 '25 14:04 jacoterh

could you be more specific?

At the beginning you did a test with trvl: 1.0 and you found small numerical differences. My suggestion to check whether these grow enough to be noticeable are:

  • A fit with trvl: 1.0 (diagonal or not, the result should be equivalent)
  • A fit in which trvl: 0.75 but you first apply the cuts, then diagonalize training and validation (perhaps with genrep = false to make it easier).

These two tests will ensure that the differences that we see in your report are coming from the cuts (entire dataset vs dataset-by-dataset) and then we'll need to think about the implications of that*

*e.g., in the diagonal case it doesn't make sense to have frac-per-dataset or to concatenate the masks

scarlehoff avatar Apr 08 '25 16:04 scarlehoff

  • A fit with trvl: 1.0 (diagonal or not, the result should be equivalent)

Results are equivalent, as you can see in this DIS only fit without any split: https://vp.nnpdf.science/UobKaEuXQf-BwMeMN_Z2Mw==/

jacoterh avatar Apr 09 '25 10:04 jacoterh

  • A fit in which trvl: 0.75 but you first apply the cuts, then diagonalize training and validation (perhaps with genrep = false to make it easier).

That agrees with what I've been doing so far. I apply the training/validation split at the individual dataset level before rotating to the diagonal basis. I am not sure it matters when we apply the mask. We can either concatenate all the individual masks or compute one big mask at the very end after combining all datasets. As long as datapoints get randomly assigned and the split is 75/25 in both, these should be equivalent, no?

jacoterh avatar Apr 09 '25 10:04 jacoterh

I thought you were applying the mask after diagonalizing since you apply it to eig_vals (so, in a fit without MHOU, it s applied at the level of the experiment, with MHOU to the entire dataset). If so the data that is cut away is not the same and it will actually be a quite different set (since the order gets all messed up).

I am not sure it matters when we apply the mask. We can either concatenate all the individual masks or compute one big mask at the very end after combining all datasets. As long as datapoints get randomly assigned and the split is 75/25 in both, these should be equivalent, no?

The results might be equivalent in the limit of many replicas. Also, by concatenating the individual masks you are not covering some situations (e.g. you will never get the top-25% of eigenvalues being masked) so it is not completely random.

Btw, I don't know how deterministic eigh is either, which means we should probably order them in order to be able to reproduce fits.

edit: just to say, my proposal is just to test that there are no numerical problems arising in a full fit when doing the diagonal covmat in a situation in which cuts stay the same.

scarlehoff avatar Apr 09 '25 10:04 scarlehoff

Very good, I see. Just to clarify that I am applying the mask after the rotation. I just meant to say that I only construct the mask before the rotation. And indeed, I am applying it to the eigenvalues, as you noticed :)

jacoterh avatar Apr 09 '25 11:04 jacoterh

Hi @scarlehoff , I have performed the check you proposed. That is, comparing

  1. First splitting, then diagonalise training and validation separately
  2. Same split, don't diagonalise

I put the relevant code I needed to perform this check on a separate branch as it would interfere with what we actually want (diagonalising before splitting) https://github.com/NNPDF/nnpdf/tree/diagonal_covmat_exc_juan

The loss values agree between 1 and 2 (up to precession effects) log_splitting_diag_trvl_sep.txt

jacoterh avatar Apr 10 '25 16:04 jacoterh

Nice, thanks!

Then the next step I'd say is to do a large fit (many replicas) to see how the choice of diagonal vs diagonal might affect the results.

Btw, perhaps instead of doing the splitting in the dictionary creation it makes sense to do it in the function tr_masks above. If the diagonal is on instead of going through the loop of datasets you exit early with what you are doing now but outputing the same kind of object. This might help with some of the errors.

Some others are related to the fact that we try to write training and validation data down during the fit and I'm not sure whether we want to do that in the diagonal basis...

scarlehoff avatar Apr 11 '25 06:04 scarlehoff

Just for the record and to help us to keep track, these are the reports with the negative modes removed in the diagonal basis:

  • real data: https://vp.nnpdf.science/V-SA51LcS_Kx7wEx8p6vFg==
  • fake data: https://vp.nnpdf.science/rceHcI8MTdaWpZlKkb05vw==

The split and change of basis is done at the level of the full covariance matrix, not per dataset separately.

jacoterh avatar Apr 17 '25 10:04 jacoterh

The split and change of basis is done at the level of the full covariance matrix, not per dataset separately.

This I don't understand. I had a look yesterday after the meeting and indeed the function in which the diagonalization occurs is called experiment by experiment (not by dataset, perhaps I said dataset when I meant experiment, apologies).

In any case, all the data in the runcard is only taken together when MHOUs are enabled. So it would be interesting to see whether the issue we discussed actually disappears in that case.

scarlehoff avatar Apr 17 '25 10:04 scarlehoff

Indeed, I was also just checking this and you're right, it's called by experiment! This wasn't obvious to me before as I had tested it only on datasets within the same experiment and I noticed all datasets got combined.

Edit: we should split at the level of the full covmat though, not per experiment, because the covmat will no longer be block diagonal in the case of including MHOU as you rightly point out.

jacoterh avatar Apr 17 '25 10:04 jacoterh

@scarlehoff In case of MHOU, do you have an idea where in validphys the various experiments get collected such that they are treated as one? I'm seeing it works this way, but I cannot find what part in the code is responsible for this.

jacoterh avatar Apr 17 '25 17:04 jacoterh

It seems the grouping in case of MHOU is handled by this statement: https://github.com/NNPDF/nnpdf/blob/3d10bbc0a6ba93d33d7951a573fa6bbb8d91456a/validphys2/src/validphys/config.py#L1654

jacoterh avatar Apr 18 '25 15:04 jacoterh

Hi @scarlehoff, I am working on the tests, and I'm seeing that test_pseudodata.py results in

FAILED validphys2/src/validphys/tests/test_pseudodata.py::test_recreate_fit_pseudodata - AttributeError: 'NoneType' object has no attribute 'is_polarized'
FAILED validphys2/src/validphys/tests/test_pseudodata.py::test_recreate_pdf_pseudodata - AttributeError: 'NoneType' object has no attribute 'is_polarized'
FAILED validphys2/src/validphys/tests/test_pseudodata.py::test_read_matches_recreate - AttributeError: 'NoneType' object has no attribute 'is_polarized'

This is triggered by coredata.py in determine_pdfs where pdf is none such that pdf.is_polarized returns the AttributeError error above. Do you have any idea how this could be related to the implementation of the diagonal basis, I don't see how. Thanks!

jacoterh avatar May 13 '25 12:05 jacoterh

determine_pdfs is called by the convolution functions to check which kind of PDF is to be used, so at some point a convolution with None is happening (a t0 covmat perhaps being computed?)

If I have to guess, this is due to some collateral damage from changing the grouping since all these functions depend on that.

scarlehoff avatar May 13 '25 12:05 scarlehoff

Indeed, t0set is None in https://github.com/NNPDF/nnpdf/blob/81c0da789dc5f3dfdcb8910bf4821765f1a23965/validphys2/src/validphys/covmats.py#L250

while it gives a PDF instance on the master branch

jacoterh avatar May 14 '25 13:05 jacoterh

Note that the master branch since (two days ago?) is using the t0-sampling-by-default that @comane implemented, so it is not necessarily the same as you have.

That said, I don't understand why changing the grouping (if that's the reason) would pass you through the t0 predictions functions...

scarlehoff avatar May 14 '25 13:05 scarlehoff

When I deliberately remove dataset_inputs_t0_predictions validphys tells me it's needed to process

E  reportengine.configparser.InputNotFoundError: A parameter is required: dataset_inputs_t0_predictions.
E           This is needed to process:
E            - recreate_fit_pseudodata
E           through:
E            - fit_masks
E           through:
E            - replica_mask_table
E           through:
E            - replica_mask
E           through:
E            - exps_masks
E           through:
E            - masks
E           through:
E            - dataset_inputs_fitting_covmat

jacoterh avatar May 14 '25 13:05 jacoterh

Hi @scarlehoff , regarding the last failing test, is it enough to rerun the fit bot to update the jsons in extra_tests/regression_fits to fix this? At the moment, I am getting

FAILED extra_tests/regression_checks.py::test_regression_fit[False-normal_fit-72] - AssertionError: 
FAILED extra_tests/regression_checks.py::test_regression_fit[False-diagonal-45] - AssertionError: 
FAILED extra_tests/regression_checks.py::test_regression_fit[False-no_lagrange-27] - AssertionError: 
FAILED extra_tests/regression_checks.py::test_regression_fit[False-polarized_evol-34] - AssertionError: 
FAILED extra_tests/regression_checks.py::test_regression_fit[False-t0theoryid-100] - AssertionError: 

jacoterh avatar May 19 '25 11:05 jacoterh

Yes. I think it is only the diagonal one failing in the CI, all the others fail because of differences between your computer and the CI.

scarlehoff avatar May 19 '25 11:05 scarlehoff

Oh, the t0 sampling. I think it would be better if you skip the last commit by the bot and rebase 475bbe7 on top of master, then run the bot.

scarlehoff avatar May 19 '25 11:05 scarlehoff

Greetings from your nice fit :robot: ! I have good news for you, I just finished my tasks:

  • Fit Name: NNBOT-eb97adb45-2025-05-19
  • Fit Report wrt master: https://vp.nnpdf.science/eB8a3DN_R6uKiFsVE8E85Q==
  • Fit Report wrt latest stable reference: https://vp.nnpdf.science/iWSUOv7KQcGblpVhz5VDXg==
  • Fit Data: https://data.nnpdf.science/fits/NNBOT-eb97adb45-2025-05-19.tar.gz

Check the report carefully, and please buy me a :coffee: , or better, a GPU :wink:!

github-actions[bot] avatar May 19 '25 12:05 github-actions[bot]

Greetings from your nice fit :robot: ! I have good news for you, I just finished my tasks:

  • Fit Name: NNBOT-01c9eb86a-2025-05-19
  • Fit Report wrt master: https://vp.nnpdf.science/_HFT1HZNSASlSjDKshel9g==
  • Fit Report wrt latest stable reference: https://vp.nnpdf.science/jzI_LZ5lSI2BTZ2fErkxmA==
  • Fit Data: https://data.nnpdf.science/fits/NNBOT-01c9eb86a-2025-05-19.tar.gz

Check the report carefully, and please buy me a :coffee: , or better, a GPU :wink:!

github-actions[bot] avatar May 19 '25 14:05 github-actions[bot]

Thanks @scarlehoff , what's going on with the tests? I had fixed them all a couple of days ago (up to the regression_tests), but now it seems like new ones have appeared...

jacoterh avatar May 21 '25 08:05 jacoterh

Not sure, but in the last few changelogs from numpy they seem to be working on the thread-safety of some functions so that can change the order of operations. And diagonalizing a huge not-super-well-behaved matrix is very sensitive to that (for the data the change happened for CMS 1 Jet).

Edit: these last ones were not due to numpy instead, interesting. if it happens again we'll have to look closely at the diagonalization...

scarlehoff avatar May 21 '25 08:05 scarlehoff

For reference, links to reports

  • comparing two fits with and without diagonal basis in the presence of MHOUs and t0 sampling: https://vp.nnpdf.science/1a6KBrruQdyDUTw_lCyaRQ==/
  • comparing the no-diagonal basis implementation on master and the current branch: https://vp.nnpdf.science/3CKAWh5LQP2eI0dJVnJAqQ==/. Difference is probably due to the new way we take the inverse of the covariance, also in the non-diagonal basis - we take the inverse of the correlation and convert this to the inverse covariance as opposed to directly taking the inverse of the covmat. Around April we still got perfect agreement, see: https://vp.nnpdf.science/obaVuxjRSv-4fyLpg80jhw==/.
  • Directly taking the inverse of the covmat results in better agreement between this branch and master: https://vp.nnpdf.science/r7a-HbwjTniL-qzFYUUspg==/. @scarlehoff Are we happy with how this comparison looks? I was expecting 100% agreement, and I'm not sure at this point why we're not getting that.

jacoterh avatar May 28 '25 21:05 jacoterh

Hi @jacoterh

Given what was just decided, could you change the default to true before merging? :P

Please, change also the nnpdf40-like template to add diagonal_basis: True explicitly with a comment so that people using that runcard has the information there.

( all regressions are going to fail of course, but I would say that instead of having one regression that is "diagonal_basis" could you change it to "non_diagonal_basis"? https://github.com/NNPDF/nnpdf/tree/master/extra_tests/regression_fits )

NB: regenerating old pseudodata is going to fail because the trvl mask is going to change, perhaps we can fix that here: https://github.com/NNPDF/nnpdf/blob/516e83785bf269e83ad7624bc643cf9c072cb754/validphys2/src/validphys/config.py#L276 such that if a fit is done with < 4.1 then it is assumed that diagonal_basis = False, True otherwise. I was planning to do the first 4.1.0 tag after this PR + the one with the tests was merged anyway so doing a version check for 4.1 here would work great.

Directly taking the inverse of the covmat results in better agreement between this branch and master: https://vp.nnpdf.science/r7a-HbwjTniL-qzFYUUspg==/. @scarlehoff Are we happy with how this comparison looks? I was expecting 100% agreement, and I'm not sure at this point why we're not getting that.

The difference here looks like it is at the numerical precision level, I think it is ok.

scarlehoff avatar May 30 '25 11:05 scarlehoff

Could we merge this please and then address the other points in another PR? We are relying on this with the hyperopt PR and it is getting a bit of a pain to merge locally while at the same time having changes.

Radonirinaunimi avatar Jun 16 '25 08:06 Radonirinaunimi