idefix icon indicating copy to clipboard operation
idefix copied to clipboard

BUG: Profiling misattributing CPU time in recursive calls

Open francois-rincon opened this issue 9 months ago • 4 comments

Describe the issue:

Hello, after a few hours of hair pulling trying to profile my code and thinking I was spending too much time in a recursive call, which I apparently wasn't, I think I found the problem, and it has to do with using profiling in recursive calls.

When you initiate recursive calls (like Loop<IDIR>(t,dt)) you can then have several idfx::pushRegion called consecutively by subfunctions like calcflux and calcrhs (if you have pushed/popped them in their code) and then idfx:popRegion() is called several times within that recursion without higher-level pushRegion in the recursion being closed first by their own popRegion().

I suspect this is messing up time accounting in the profiler... I have a code with EvolveStage() idfx::pushRegion("Evolvestage") Do main stuff Loop<IDIR> idfx::popRegion()

but in Loop I have calls to calcflux and calcrhs which themselves look like

calcflux() idfx::pushRegion("Calcflux") Do stuff idfx::popRegion()

so the pushRegion("Evolvestage") is not closed by a pop before popRegion() is called in Calcflux (and it gets worse because then loop is called recursively on direction and so is calcflux, before the popRegion of EvolveStage is called).

In this example I noticed that in the profiling output a great deal of the CPU time actually spent in the main part of EvolveStage was attributed to the function (CalcFlux) called in Loop<IDIR> . So I think the popRegion from that CalcFlux increments the time counter for the time spent in this function not with the time really spent in this function but with the time elapsed since the main pushRegion in EvolveStage.

Or something like that...it's very clear in any case that the profiling time accounting in this case is doing weird things.

when I do a pushRegion in the main of EvolveStage followed by a popRegion() BEFORE Loop<IDIR> is called and then a sequential push/popRegion enclosing Loop<IDIR> , also removing the push/pop in calcflux, I do get the right timings. So it seems like it is the nesting of pushRegion and popRegion is causing trouble here...

N.B.: I have been careful to pass the right function names to the pushRegion calls, so it's not a problem of incrementing one counter with the timing of several functions.

I haven't looked at the entrails of the mechanics of profilings and of pushRegion/popRegion but I think what I see kind of makes senses considering popRegion does not take as an argument the function that it times, so is maybe blind to the specific pushRegion time information it needs to take into account (and apparently it takes that from the very first push in the nesting).

not sure if it's expected behaviour, and if one is supposed to always call the profiling pushRegion/popRegion functions sequentially ? This is not how we coded things for the PATOU fork in the Ecology::EvolveStage at least, and a quick look suggests it it also how things are done inside Fluid<Phys>::EvolveStage with LoopDir and the inner call to calcflux of the Riemann solver

Does this make sense ?

Error message:


runtime information:

Patou merged to recent master of idefix

          Idefix version 2.2.01-d5257c4
          Built against Kokkos 40301
          Compiled on May  1 2025 at 21:14:47

N.B.: all observations for a GPU run, haven't tested on CPUs

francois-rincon avatar May 02 '25 08:05 francois-rincon

Hi François,

So yeah, the timing provided by Idefix between the pushRegion and the popRegion is "all included", meaning that in your example the EvolveStage region timing will include the time spent in the function EvolveStage plus all of the subroutines called within that function (that includes the recursion on <dir>). So it's not the "self" timing, but the all inclusive timing that is provided by idefix.

The way the pop/push regions are handled is like a usual computer stack (hence the name of the function): it's a FIFO strategy. In principle it could be possible to infer the time spent in self by doing some arithmetics (actually the original kokkos-tools profiling does exactly that). I've never encountered any situation where this was actually needed...

Now, I'm not sure this is a bug (or if it is, I don't understand why). Maybe the documentation should be more explicit?

glesur avatar May 05 '25 07:05 glesur

Not sure we are on the same page... My problem seems to be different: it is that when I time the CalcFlux within the Loop, through a push/pop inside this function, on top of push/pop in EvolveStage, then then computing time spent outside of CalcFlux, in EvolveStage before calling the Loop and Calcflux, is attributed to CalcFlux (not EvolveStage). And because I do 5D loops on my interaction matrices (which complexity goes like Ns^2 NxNyNz, where N is my number of species) in EvolvesStage, before calculating the Loop and Calcfluxes/CalcRhs for advection and diffusion, this time is actually significantly bigger (and gets much bigger when I increase Ns) than the time actually spent in CalcFlux. I was wondering why so much time was being attributed to CalcFlux while it only had complexity Ns * Nx* Ny * Nz, and figured out this is because that time accounted for the time spent before calling CalcFlux, in EvolveStage.

When I do sequentially

pushRegion loop on 5D interaction matrices popRegion pushRegion Loop --> CalcFlux popRegion

(or with the second push/pop in CalcFlux or Loop directly) I do get the right timing for each bit. So I think there is an issue when you nest the push / pop.

I think you may not have noticed this earlier because you don't have this kind of situation in the standard fluid EvolveStage. It is the added Ns^2 complexity of the interaction loop in EvolveStage, before calling CalcFlux, that made me realize there may be an issue with the timing here.

Does this clarify the issue ?

francois-rincon avatar May 05 '25 12:05 francois-rincon

ok, I see. You might be facing an out of sync issue since you're doing a very large loop that is not finished on the GPU side, when the CPU entrers the CalcFlux region.

Can you try to add Kokkos::fence(); in the pushRegion function and see if this fixed your issue ? (here: https://github.com/idefix-code/idefix/blob/ede2e1f5e30e4486bd1dfcf141b4a67126634093/src/global.cpp#L66 )

glesur avatar May 05 '25 12:05 glesur

will try !

francois-rincon avatar May 05 '25 12:05 francois-rincon

Sorry it took me a bit of time to get back to this as other devs took priority. Just checked that it fixes the issue (also, the problem only happened on GPUs). With fence the timing is correct, as most of the time in EvolveStage is spent in big 5D loops before the subcall to CalcFlux.

*Without Kokkos::fence() in pushRegion() Image

*With Kokkos::fence() in pushRegion() Image

francois-rincon avatar May 16 '25 13:05 francois-rincon

Fixed in #341

glesur avatar May 16 '25 16:05 glesur