SU2 icon indicating copy to clipboard operation
SU2 copied to clipboard

Aachen Turbine Test Case

Open jblueh opened this issue 1 year ago • 1 comments

The Aachen turbine test case added in #2158 needs another look, for two reasons.

  1. The test case is subject to a yet unfixed memory access violation, as detected in the address sanitizer tests in #2246. We explored a fix in c97610b02bc1281eb766847e7ef5ed7ac451dfb9, that remedies the address sanitzer detections but also changes the test results notably. It is unclear yet whether this is the correct way to fix it.
  2. @bigfooted observed non-deterministic behaviour in #2260. This might or might not be related to the memory access violation.

As soon as the memory access violation is fixed, address sanitizer testing can be re-enabled for the Aachen turbine test case (it was disabled in 0f3fc3ec900f1518c0c3ed26c3e971ae27fb974c).

@alecappiello We can continue the discussion from #2246 here.

jblueh avatar May 08 '24 11:05 jblueh

As mentioned previously I would be surprised if the fix implemented in c97610b is the source of the issue as I would expect the other turbomachinery cases to fail as well. Looking at the testcase there would be two things I would try.

  1. Converge the solution very tightly, however I expect that this may be difficult, save the converged solution. Restart the solver and run for 1 iteration and compare the results of the two. Inspect the solution and see if there is anything obvious, this may give a hint as to where the problem lay. Maybe also include linear solver information in the output to see if the desired convergence is being achieved in the initial iterations.

  2. Furthermore inspect the residuals of the initial converged solution as it approaches the restart point, a jump in residuals could indicate that the solver is not converging smoothly before the termination of the initial solution. I suggest maybe increasing the value of ENTROPY_FIX_COEFF, say 0.1 and see what happens.

If you can't converge the solution tightly in point 1, apply the increased entropy fix in point 2 and retry. If you would like @alecappiello once you have tried this we can organise a call to discuss early next week.

joshkellyjak avatar May 08 '24 14:05 joshkellyjak

@joshkellyjak The memory access violation described here does not occur unconditionally, it depends on the test case. The iterations of the inner loop for jSpan == 0 and jSpan == nSpanDonor - 1 (that we removed in the fix) only have an effect if the if-conditions in the loop body evaluate to true. I checked this for the serial test cases in the test file's "turbomachinery" section.

  • aachen_turbine_restart: iterations both with jSpan == 0 and jSpan == nSpanDonor - 1 have an effect, the latter trigger the memory access violations
  • jones_turbocharger_restart: some iterations with jSpan == 0 have an effect, but iterations for larger jSpan usually overwrite the results; also, the case jSpan == 0 does not trigger the memory access violation
  • axial_stage2D: the problematic code is not executed by this test case
  • transonic_stator_restart: the problematic code is not executed by this test case

I think these observations point to why only the Aachen test case had address sanitizer findings, and why only the results of the Aachen test case were affected by the attempted fix. The other test cases basically "do not care" whether we do the extra iterations.

@alecappiello Regarding your earlier question about the restart file. Restart files are to some extent specific to the version of SU2 they were obtained with, in the sense that newer versions of SU2 can behave differently if they use an old restart file. I was wondering if this is the case here, too. If the memory access violation had manifested itself in the restart file, a version of SU2 with the memory access violation fixed could give different results. The observations above point to why only the Aachen test case is affected.

jblueh avatar May 23 '24 17:05 jblueh

I'll try and find some time to have a look at this today. I think the issue is in the jSpan == nSpanDonor - 1. The final value in both the donor and target arrays are 1D values, the values at nSpanDonor - 2 are the values at the shroud. I think the original proposed fix is still needed here as further down you have an array accessing a postion of kSpan + 1 which iirc resulted in the memory access violation. I think what has happened here is the Aachen case triggers this condition, and results in an error in the calculation of the coefficient for the linear interpolation. When we fixed it, we changed the simulation enough that it throws the residuals off as the computational problem is inherently different. A good test for this would be to extract the values at the interface from the turbomachinery special output with and without this change to see if this is the case. When we were first designing this case @alecappiello had an issue getting the results in one of the zones to agree with experimental results, this could be why?

joshkellyjak avatar May 28 '24 09:05 joshkellyjak

RESTART_FILENAME and SOLUTION_FILENAME are the same and this test is run both in serial and parallel_regression.py That means that whichever test runs first overwrites the restart file used by the other.

pcarruscag avatar Jun 09 '24 18:06 pcarruscag

Thanks @joshkellyjak and @pcarruscag for your remarks!

jblueh avatar Jun 27 '24 18:06 jblueh