pyclaw OpenMP in Pyclaw

I have updated step3ds.f90 in PyClaw to be able to use openmp. Some small changes were necessary to classic/solver.py to get the openmp environment variable OMP_NUM_THREADS and to properly set the size of the work array. Also the call to step3ds() was changed slightly such that there are not assumed size arrays (f2py does not like these; nthreads is passed in instead) Additionally classic/setup.py was modified to explicitly add the '-fopenmp' f90 compiler flag as well as adding the 'gomp' library. I'm not sure if this is the best way to handle this, but it does work.

I have not added the openmp directive to step3.f90, but it should be mostly trivial at this point.

Oct 16 '15 21:10 weslowrie

We should modify the travis test case so that it tests this code (set OMP_NUM_THREADS=2 maybe). I am also not entirely comfortable with having to directly use the environment variable rather than leave it up to the system. I noticed that the number of threads is being used to allocate work space but I am uncertain if I completely get why OpenMP cannot do this for us.

Oct 17 '15 16:10 mandli

The line with 'int(os.environ.get('OMP_NUM_THREADS',1))' gets the value set by the environment variable or if not set defaults to 1. That would be nice if openmp could handle the work array sizing, as well as the extra dimension on aux1, qadd, etc. You may have noticed there is a slight hack to get f2py to not assume that 'nthreads' is optional.

Oct 17 '15 17:10 weslowrie

Yeah, f2py is a bit annoying like that. In any case, if an array enters a parallel region and forks to the threads it should make copies of anything in the stack unless otherwise specified (via a shared directive). That being said, if these threads are being forked and collected often then it might be better to create persistent storage. I wonder if instead of using the environment variable that there is some way to fetch the desired number of threads.

Oct 17 '15 18:10 mandli

@mandli

I wonder if instead of using the environment variable that there is some way to fetch the desired number of threads.

Are you wondering if we can query the system to find an optimal number of threads? Or maybe setting the threads via program input rather than using the environment variable?

We could set input on the python side with something like this: nthreads = os.environ["OMP_NUM_THREADS"] = "2"

Oct 19 '15 15:10 weslowrie

There are a number of ways to set the number of threads that OpenMP can use beyond just the environment variable. In the OpenMP API there is a way to query this but I am not sure how to do this in Python.

Oct 19 '15 15:10 mandli

I did a bit of looking into this but could not find a good way to call omp_get_max_threads which is what I was thinking. I do have a question though, how would this interace with the process based parallelism, is their a way to spawn say 16 threads on a node and have the PETSc library handle inter-node communication?

Oct 24 '15 14:10 mandli

This is great -- thanks, @weslowrie ! My main question is: does this cost us anything when running without OpenMP? I see two potential concerns:

Does compiling the code with -fopenmp make anything slower?
Is it at all probable that users will try to compile the code on a system without OpenMP and thus get compilation failure? Or does installation of gcc/gfortran always include OpenMP?

@mandli I guess the first thing to do is just try a PETSc+OpenMP run. @weslowrie have you tried that already by any chance?

Oct 25 '15 07:10 ketch

Does compiling the code with -fopenmp make anything slower? Is it at all probable that users will try to compile the code on a system without OpenMP and thus get compilation failure? Or does installation of gcc/gfortran always include OpenMP?

I'm not sure about either of these, probably worth some tests. I think it is possible to NOT include the 'gomp' library while still having gcc, so some systems might not have the proper openmp library.

I recently noticed that the self.nthreads = int(os.environ.get('OMP_NUM_THREADS',1)) line is not working as intended when the environment variable is not set. I'll look into this.

@mandli I guess the first thing to do is just try a PETSc+OpenMP run. @weslowrie have you tried that already by any chance?

I have not tried the MPI+OpenMP combo yet. I might be able to run a test on a machine with multiple nodes to see if it works. I'll let you know if I am able to successfully test this.

Oct 26 '15 16:10 weslowrie

I'm trying to have a default behavior when the openmp env variable OMP_NUM_THREADS is unset. The problem is the line in step3ds(): !$ nthreads = omp_get_num_threads() sets nthreads to the number of processors present on the system if the env variable is unset. This contradicts the python side, which sets self.nthreads=1 when unset, and the mwork array is sized incorrectly.

I'm not sure how to handle this yet. A child process cannot change the shell env variable, which is what openmp looks for. I also did not see an obvious way to override openmp's decision when omp_get_num_threads() is called. Anyone see an acceptable way to handle this?

Oct 26 '15 18:10 weslowrie

omp_get_num_threads should return the current number of threads regardless of what OMP_NUM_THREADS is. omp_get_max_threads without OMP_NUM_THREADS defaults to the number of threads defined by the system as the maximum number allowed. This was one of the reasons I was a bit concerned about using OMP_NUM_THREADS in Python. Maybe we should just require that OMP_NUM_THREADS is set or create a simple Fortran subroutine that asks it how many threads are available and call it from python.

Oct 26 '15 19:10 mandli

@weslowrie Some testing is being done now by @aymkhalil -- we're going to see if we can run with MPI+OpenMP on the new supercomputer here, and if it helps over just vanilla MPI.

I'm pretty sure we'll want to merge this in, so could you go ahead and add the OMP modifications to step3.f90 too?

Oct 28 '15 12:10 ketch

@ketch Yes no problem. I'll update step3.f90 as well.

We should find a resolution to the case where OMP_NUM_THREADS is unset or there is no openmp on a system. One possible easy solution that @mandli suggested is to just check for the OMP_NUM_THREADS env variable and exit the program if it is unset. Would we want the default to be a run with the maximum possible number of threads?

Also, in the python setup.py do we need to have a way to skip -fopenmp and gomp compile flags if openmp is not present on the system?

Oct 28 '15 19:10 weslowrie

Can you just use the

#if defined(_OPENMP)
    #pragma omp ...
#endif

pre-processing macro? This is defined by OpenMP standard. See

http://stackoverflow.com/questions/1300180/ignore-openmp-on-machine-that-doesnt-have-it

and

http://bisqwit.iki.fi/story/howto/openmp/#Discussion

These are C/C++ macros, but I would imagine that some thing similar is available in Fortran.

Oct 28 '15 20:10 donnacalhoun

It is but I think @weslowrie needs it in Python which we have not been able to figure out beyond running a call through C or Fortran.

Oct 28 '15 21:10 mandli

I have added a modified step3.f90 to include OpenMP as done in classic clawpack. I tested this with the euler_3d, Sedov.py and the regression test within the folder. OpenMP appears to be working properly. Also as a possibly temporary fix, I used the python multiprocessing module to check for the number of CPUs. I have used this to set the default number of threads, which is consistent with what omp_get_num_thread() returns when the environment variable is unset. This may not work on all systems, and a more rigorous solution probably needs to be implemented.

Oct 29 '15 06:10 weslowrie

Good solution!

Oct 29 '15 19:10 mandli

I may be wrong about this but if someone were to not set OMP_NUM_THREADS=1 and not compile with the OpenMP library incur a memory footprint penalty up to the number of cores they have, at least for the aux arrays?

Nov 04 '15 15:11 mandli

@mandli yes you are correct, and it is a problem. Not only are the aux arrays larger, but the work array is larger as well set in _allocate_workspace() in solver.py. This would occur even if they did not compile with the OpenMP library. I'll have to think of a better solution.

How does the PyClaw setup typically deal with custom compiler flags? Should OpenMP be the default, or should we expect the user to set custom flags while compiling/installing PyClaw?

Nov 04 '15 18:11 weslowrie

The flags are usually in the setup.py files in the relevant src directories.

Interesting question about the default though, should OpenMP be enabled and use all of the cores on a machine if someone did not set OMP_NUM_THREADS? Seems we may want to require OMP_NUM_THREADS to be set to use this but I may be convinced otherwise.

Nov 04 '15 21:11 mandli

@mandli: Don't AMRClaw and GeoClaw use OpenMP now? How are these issues handled there?

Nov 05 '15 06:11 ketch

I made some changes, and I think it gives a desirable default behavior. The python code checks if the OMP_NUM_THREADS variable is set, if not it gives a warning and sets the default array sizes with self.nthreads=1. On the Fortran side, it also checks to see if the environment variable is set, if not it forces the OpenMP number of threads to 1 with: !$ call omp_set_num_threads(1) Otherwise it continues as before and reads the environment variable.

I think this gives a desirable default behavior as an unsuspecting user will only see a warning about the unset env variable, and can set it if they want. If one sets the env variable, then it just does the calculation with set number of threads. This way we don't have any arrays that are larger than necessary.

Possible last fix, and I don't know how to do this: Check if the OpenMP compiler flags and libraries are used in compiling PyClaw and then suppress the OMP_NUM_THREADS env variable warning that was just added to solver.py

Nov 05 '15 17:11 weslowrie

@ketch By default OpenMP is not included (in fact we do not specify any flags, it uses the compiler's defaults). We have gone back and forth about what to do for this but have come to no definitive conclusions (really should do something like set an OPT flag or something).

@weslowrie Is this warning visible in general given default logging levels? Otherwise this sounds reasonable.

Nov 05 '15 19:11 mandli

@mandli I'm not sure about the default logging levels. Do I need to do anything more than add a line like this: warnings.warn("...") for default logging?

Nov 05 '15 22:11 weslowrie

I just cannot remember if warnings are printed to the console or just to the log file if using the standard PyClaw loggers. Speaking of that you also might want to instead use the logger attached to the solver instead. You can use it with

self.logger.warning(warning_stuff)

Nov 05 '15 22:11 mandli

@mandli I modified it so it uses the PyClaw logger as suggested. It writes the warning to the console and the log file with this method. I did have to move the code below the super(ClawSolver3D,self).__init__(riemann_solver, claw_package) line because the logger otherwise had not been instantiated yet.

Nov 06 '15 17:11 weslowrie

I would perhaps send the message to a level that is not sent to the console so as to not worry users who are not concerned with such things. This should still send the message to the log file though so astute users can look there.

Nov 06 '15 18:11 mandli

@mandli Do you mean setting it to the INFO or DEBUG logger level? I'm not sure why, but if debug level is used, it is written neither to the console or file.

Nov 06 '15 18:11 weslowrie

@mandli I guess the first thing to do is just try a PETSc+OpenMP run. @weslowrie have you tried that already by any chance?

I tried a PETSc + OpenMP run, and superficially it looks like it is working well. I did not time the runs, but it was a noticeable speedup. I ran on 4 nodes (4 MPI jobs), with 4 OpenMP threads per node. The machine I used has 4 CPUs per node.

Nov 06 '15 19:11 weslowrie

Yes to the logger level. Which is not being sent to the logfile?

Nov 06 '15 21:11 mandli

If I modify the code to: self.logger.debug(message) The message is sent to neither the console nor logfile. Probably due to the default level set for the solver logging. I don't think I should modify the solver logging levels here.

Nov 06 '15 21:11 weslowrie