CHT (and/or heat zone) restart (primal+adjoint) issues
Describe the bug Hi all,
I noticed some issues with restarts (primal only and for the primal-iteration in the discrete adjoint). I do the following:
- Run a simulation with X+1 iterations. The residuals of the X+1st iterations are basically what we try to recreate in the restarted version. This is the ground truth
- Run a simulation with X iterations. This gives us a file to restart from. (A good sanity check with the 1. simulation is to diff with the history file to see whether the simulations are deterministic at all)
- Run a primal restarted simulation from the restart file of 2. with just 1 iteration. The residuals should match the results of 1.
- Run an adjoint simulation with the restart file of 2. . The residuals should match the results of 1.
What you will see in the following is 3 lines with multiple residual values each. The first line corresponds to the X+1st history entry of simulation 1. (the ground truth). The second line is the restarted primal from simulation 3., also last history line. The third line is from the adjoint-primal-restart, grabbed from the screen output with OUTPUT_PRECISION=12 (see #1394 )
1.res[1], 1.res[s] ...
3.res[1] ....
4.res[1]
Of course the best outcome would be 3 identical lines ... which we dont get :(
Pin Array setup 2D - Fluid Only
(p, vx, vy, T, k, w) 200 iterations
-7.16607941386 -7.34805457325 -6.99877222345 -1.01313133295 -8.55717653108 -1.6476144338
-7.16607941386 -7.34805457325 -6.99877222345 -1.01313133295 -8.55717653108 -1.6476144338
-7.16607941386 -7.34805457325 -6.99877222345 -1.01313133295 -8.55717653108 -1.6476144338
everythings fine 👍
Pin Array setup 2D - Solid Only
Note that this will only work with the fix in #1394
10 iterations (8 Linear Solver Iter)
-6.83193258622
-6.83193258622
-6.83193258622
10 iterations (10 Linear Solver Iter)
-7.38737630018
-7.38737630016
-7.38737630016
10 iterations (20 Linear Solver Iter)
-8.92762658265
-8.92762658317
-8.92762658317
10 iterations (200 Linear Solver Iter)
-8.92702259526
-8.92702259594
-8.92702259594
Here I suspect some floating point things going with some minor error that accumulates up to a certain point. Doesn't worry me too much to be fair
200 iterations (10 Linear Solver Iter)
-16.5822916687
-16.2000952843
-16.2000952843
But with more iterations more problems arise. So back to the drawing board for that. Maybe here the root cause for the cht problems is hidden as well.
CHT Pin Array setup 2D
Here things get really weird.
- With low iteration count it looks like the primal-only restart works perfectly and only the solid residual of temperature is flawed
- With higher iteration counts the solid temperature is still different but now both restarted mean flow residuals are not in line with the X+1 iteration simulation ... what?
(p, vx, vy, T_fluid, T_solid) 10 Iterations
-4.5580336629 -4.71337114354 -4.64920624665 1.52390474896 -5.8150835186
-4.5580336629 -4.71337114354 -4.64920624665 1.52390474896 -5.8150835186
-4.5580336629 -4.71337114354 -4.64920624665 1.52390474896 -6.27627665971
200 Iterations
-12.6894989871 -13.0272466772 -12.776380701 -1.01446550457 -7.17890161426
-12.6894989199 -13.0272465259 -12.7763807181 -1.01446550457 -7.17890161426
-12.6894989199 -13.0272465259 -12.7763807181 -1.01446550457 -7.30259065606
200 Iterations (No CHT interface at all, i.e. still "multizone" but no coupling between the zones)
-12.6993664689 -13.037441642 -12.7880987801 -0.895636121058 -16.5806369934
-12.6993665267 -13.0374417614 -12.7880988088 -0.895636121058 -16.1994417242
-12.6993665267 -13.0374417614 -12.7880988088 -0.895636121058 -16.1994417242
2000 Iterations
-17.5073098614 -17.7104073858 -17.9003808832 -3.34538088409 -9.30160418764
-17.4072816449 -17.5306206426 -17.7140334705 -3.34538088409 -9.30160418771
-17.4072816449 -17.5306206426 -17.7140334705 -3.34538088409 -9.425709713
Also note that the residual for the adjoint-restart is better than expected, and not even by a tiny amount. This naturally leads to the hypothesis that the direct-solution is not reset after the CLEAR_INDICES run. But that is the case, I checked and I also Print the DirectResdiual for all DIrectIterations (2 flow + 2 mesh ones) and they are always the same. If the residual were to drop dramatically for the adjoint restart that would prob be easier to debug.
I of course also checked whether the correct Solution values are read, which I am somewhat sure they are... also I can only do spot checks.
4000 iterations
-17.5190807322 -17.7163086125 -17.8778784145 -5.70791061685 -11.6640663533
-17.418063519 -17.5356055663 -17.7081078178 -5.70791062246 -11.6640664169
-17.418063519 -17.5356055663 -17.7081078178 -5.70791062246 -11.7881662873
CHT Pin in Crossflow 2D
For another CHT testcase the findings are similar with one notable difference: The solid_T res between the primal restarted and adjoint restarted now match much better (although still diffreent) but both differ quite significantly from the X+1st ground truth.
10 Iter
-4.926899175 -7.918963781 -8.148204896 1.135311148 -4.163756124
-4.926899175 -7.918963781 -8.148204896 1.135311148 -4.163756124
-4.926899175 -7.918963781 -8.148204896 1.135311148 -4.472390292
200 Iter
-16.0186192 -18.98207162 -19.03351791 -2.806755076 -5.585674129
-16.00510692 -18.97843574 -19.03334954 -2.806755076 -5.585674129
-16.00510692 -18.97843574 -19.03334954 -2.806755076 -5.602103211
2000 Iter
-16.50635481 -19.9763931 -20.42969871 -10.29288196 -14.21416876
-16.44211089 -19.76100653 -20.35694756 -10.23037364 -13.79321985
-16.44211089 -19.76100653 -20.35694756 -10.23037364 -13.79321861
Note that for this specific 2000 Iter case the adjoint(-primal)-residual for the solid_T is worse compared to the ground truth X+1 ... which is the other way round for all other here seen cht cases (the "no-coupling" case shouldnt be counted for this I feel).
To Reproduce I post my setups later here. Cannot upload through vpn. I also use a simple bash script to do these comparisons for me. So the chance for manual errors is much lower.
Additional Notes A few things ahead, I run FGMRES+ILU for all configurations. No periodic boundaries at all. I went without turbulence for the cht cases to make it simpler.
In the past and now we were able to see some good gradient validation against FD. So this issue is not super dramatic (although I am pretty annoyed by that) and I think I simply overlooked it in the past.
In case there is sth unclear pls let me know, I'll will try to clarify asap
I still have some debugging to do but I appreciate all hints as I am currently more poking into the fog.
Thanks already , Tobi
Desktop (please complete the following information):
- OS: [RHELS 7.6 Maipo]
- C++ compiler and version: [g++ (GCC) 5.3.0]
- MPI implementation and version: [OpenMPI 3.1.6]
- SU2 Version: [#1394]
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is still a relevant issue please comment on it to restart the discussion. Thank you for your contributions.
Still relevant
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is still a relevant issue please comment on it to restart the discussion. Thank you for your contributions.
This is still the case, but for all practical applications like restarts or adjoint computations this does not have a notable influence. I leave this open and might tackle this at a later stage :+1:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is still a relevant issue please comment on it to restart the discussion. Thank you for your contributions.
Dear stale-bot,
this is still relevant. Might make some debugging efforts at some point.
Thanks for the reminder, Tobi