`build program` times increasing with rank count on Mac when caching is enabled
The times reported by the build program: kernel '<name>' was part of a lengthy source build resulting from a binary cache miss (<time>) output appear to increase fairly dramatically with rank count on my Mac with caching enabled, even when using rank-local cache directories. For example, when running the wave-op-mpi example in grudge, with 16 ranks and caching disabled via PYOPENCL_NO_CACHE=1, I see:
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.31 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.31 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.32 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.32 s)
With caching enabled (and empty cache) I see:
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (0.49 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (0.68 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.94 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.28 s)
(Note: rhs is missing from the first output, presumably because the time is below the output threshold. The lack of frozen_nodes0_2d in the second output is confusing though.)
If I increase to 16 ranks, with no caching I see:
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.52 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.54 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.55 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.55 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.53 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.51 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.53 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.56 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.53 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.51 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.52 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.59 s)
INFO:pyopencl:build program: kernel 'frozen_nodes0_2d' was part of a lengthy uncached source build (cache disabled by user) (0.61 s)
(again no rhs). And with caching I see:
... truncated ...
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.07 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.14 s)
INFO:pyopencl:build program: kernel 'frozen_result' was part of a lengthy source build resulting from a binary cache miss (1.18 s)
INFO:pyopencl:build program: kernel 'frozen_result' was part of a lengthy source build resulting from a binary cache miss (1.11 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.13 s)
INFO:pyopencl:build program: kernel 'frozen_result' was part of a lengthy source build resulting from a binary cache miss (1.19 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.21 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.22 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.18 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (4.17 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.31 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (4.31 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.20 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.27 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.22 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (4.19 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.84 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (2.82 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (3.13 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (9.33 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (10.46 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (10.64 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.14 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.09 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.25 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (11.41 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.45 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.55 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (13.31 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.81 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (12.98 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.04 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.83 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.18 s)
INFO:pyopencl:build program: kernel 'rhs' was part of a lengthy source build resulting from a binary cache miss (14.30 s)
... truncated ...
(full build program output here).
If I profile with pyinstrument, I see an increase in time spent in Program.build inside grudge's _DistributedCompiledFunction.__call__. Here's the profiling output without caching:
│ ├─ 38.014 _DistributedCompiledFunction.__call__ grudge/array_context.py:413
│ │ ├─ 23.718 execute_distributed_partition pytato/distributed/execute.py:103
│ │ │ ├─ 14.778 wait_for_some_recvs pytato/distributed/execute.py:185
│ │ │ │ ├─ 12.957 [self] pytato/distributed/execute.py
│ │ │ │ └─ 1.806 to_device pyopencl/array.py:2329
│ │ │ └─ 8.359 exec_ready_part pytato/distributed/execute.py:164
│ │ │ ├─ 4.575 BoundPyOpenCLExecutable.__call__ pytato/target/loopy/__init__.py:305
│ │ │ │ └─ 4.350 PyOpenCLExecutor.__call__ loopy/target/pyopencl_execution.py:349
│ │ │ │ ├─ 2.344 wrapper pytools/__init__.py:768
│ │ │ │ │ └─ 2.339 PyOpenCLExecutor.translation_unit_info loopy/target/pyopencl_execution.py:302
│ │ │ │ └─ 1.935 PicklableFunction.__call__ pytools/py_codegen.py:147
│ │ │ └─ 3.544 Array.get pyopencl/array.py:890
│ │ │ [4 frames hidden] pyopencl, <built-in>
and with:
│ ├─ 49.067 _DistributedCompiledFunction.__call__ grudge/array_context.py:413
│ │ ├─ 38.637 execute_distributed_partition pytato/distributed/execute.py:103
│ │ │ ├─ 20.316 exec_ready_part pytato/distributed/execute.py:164
│ │ │ │ ├─ 16.636 BoundPyOpenCLExecutable.__call__ pytato/target/loopy/__init__.py:305
│ │ │ │ │ └─ 16.391 PyOpenCLExecutor.__call__ loopy/target/pyopencl_execution.py:349
│ │ │ │ │ ├─ 14.543 wrapper pytools/__init__.py:768
│ │ │ │ │ │ └─ 14.537 PyOpenCLExecutor.translation_unit_info loopy/target/pyopencl_execution.py:302
│ │ │ │ │ │ └─ 12.608 Program.build pyopencl/__init__.py:505
│ │ │ │ │ │ [6 frames hidden] pyopencl, <built-in>
│ │ │ │ │ └─ 1.775 PicklableFunction.__call__ pytools/py_codegen.py:147
│ │ │ │ └─ 3.462 Array.get pyopencl/array.py:890
│ │ │ │ [4 frames hidden] pyopencl, <built-in>
│ │ │ └─ 17.774 wait_for_some_recvs pytato/distributed/execute.py:185
│ │ │ ├─ 15.945 [self] pytato/distributed/execute.py
│ │ │ └─ 1.806 to_device pyopencl/array.py:2329
│ │ └─ 8.677 _args_to_device_buffers arraycontext/impl/pytato/compile.py:524
│ │ ├─ 6.204 MPIFusionContractorArrayContext.freeze arraycontext/impl/pytato/__init__.py:429
│ │ │ └─ 6.112 PyCapsule.wait <built-in>
│ │ └─ 2.286 to_device pyopencl/array.py:2329
│ │ [4 frames hidden] pyopencl, <built-in>
Here's the script I'm using to run the example:
#!/bin/bash
if [[ -n "$OMPI_COMM_WORLD_NODE_RANK" ]]; then
# Open MPI
RANK_ID="rank${OMPI_COMM_WORLD_RANK}"
elif [[ -n "$MPI_LOCALRANKID" ]]; then
# mpich/mvapich
RANK_ID="rank${MPI_LOCALRANKID}"
fi
export POCL_CACHE_DIR=".cache/pocl_${RANK_ID}"
export XDG_CACHE_HOME=".cache/xdg_${RANK_ID}"
python -m mpi4py wave-op-mpi.py --lazy
# pyinstrument -o "pyinstrument/${RANK_ID}.txt" -m mpi4py wave-op-mpi.py --lazy
(run with rm -rf .cache && mpiexec -n 4 bash run.sh.)
I haven't been able to try running this on Lassen yet to see if I get the same behavior there; I'm currently running into some environment issues.
cc @matthiasdiener
Here's a breakdown of what's happening inside Program.build:
│ ├─ 11.333 Program.build pyopencl/__init__.py:505
│ │ └─ 11.333 Program._build_and_catch_errors pyopencl/__init__.py:554
│ │ └─ 11.333 <lambda> pyopencl/__init__.py:536
│ │ └─ 11.333 create_built_program_from_source_cached pyopencl/cache.py:489
│ │ └─ 11.333 _create_built_program_from_source_cached pyopencl/cache.py:341
│ │ ├─ 11.186 PyCapsule.get_info <built-in>
│ │ ├─ 0.145 _Program.program_build pyopencl/__init__.py:735
│ │ │ └─ 0.145 PyCapsule._build <built-in>
│ │ └─ 0.001 retrieve_from_cache pyopencl/cache.py:265
│ │ └─ 0.001 isdir <frozen genericpath>:39
│ │ └─ 0.001 stat <built-in>
The slowdown appears to be coming from these calls. Timing the two separately, it looks like the second one specifically is to blame.
I think those get_info calls just trigger the actual build downstream (ie., pocl). Do they not show up in the uncached build (maybe in a different spot)?
Based on @matthiasdiener's comment and our discussion this morning, I made some more measurements, this time on the whole compile time. Specifically, I compared the first step time of grudge wave for:
- Caching enabled (i.e., not setting
PYOPENCL_NO_CACHE). This is the path that callscreate_built_program_from_source_cachedand reads/writes cache. Note: For this test I disabled cache reading to simulate a completely cold cache (and eliminate cache reads resulting from cache writes in the same execution, which somehow does seem to happen). - Caching disabled (setting
PYOPENCL_NO_CACHE=1). This path just callsprg.build(...)directly.
If I understand correctly, the main time difference between these should come down to the cache writing time. Here's what I see (same setup as before, with rank-local cache dirs; also, I am manually applying the changes from #716, which don't seem to have made it to the version on conda yet):
The scaling is not good, but could be due to DAG splat. Additionally, it seems as if the cache writing is taking a lot of time. However, if I add a (unused) call to get_info(BINARIES) in the non-cache version I see this:
which suggests that most of the time is coming from the get_info call, not the actual cache writing. Does this make sense? Is get_info(BINARIES) doing something inefficient?
Is
get_info(BINARIES)doing something inefficient?
It sure looks that way. It might require duplicate compilation in pocl? (I'm not sure where, but your second graph is enough for me.) Based on this, I think we should definitely turn off pyopencl's CL binary caching for pocl. PR?
It might also be worthwhile to understand what pocl is doing under the hood.
I think what happens is the following:
Example pyopencl code:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
rng = np.random.default_rng()
a = rng.random(50000, dtype=np.float32)
b = rng.random(50000, dtype=np.float32)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
a_dev = cl_array.to_device(queue, a)
b_dev = cl_array.to_device(queue, b)
dest_dev = cl_array.empty_like(a_dev)
prg = cl.Program(ctx, """
__kernel void sum(__global const float *a,
__global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
""" + "c[gid] = a[gid] + b[gid];"*1000 + "}"
).build()
knl = prg.sum # Use this Kernel object for repeated calls
knl(queue, a.shape, None, a_dev.data, b_dev.data, dest_dev.data)
assert np.allclose(dest_dev.get(), a + b)
- Without
get_info(BINARIES)(i.e., cache disabled): only 1 kernel gets compiled by pocl (viapocl_llvm_codegen):./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/2000-1-1-goffs0-smallgrid/sum.so - When
get_info(BINARIES)(i.e., cache enabled) is called, it also compiles a generic version viapocl_driver_build_poclbinary./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/2000-1-1-goffs0-smallgrid/sum.so ./xdg-cache/pocl/kcache/CJ/MFPEIKHNIJCGFPPDEFBAJGFMDDFFEFBLDLAHM/sum/0-0-0/sum.so
I haven't found a way to disable this behavior.
Thanks for doing more digging here, @matthiasdiener! While we didn't decode that a "generic" kernel was being built, we did track down pocl_driver_build_poclbinary and concluded that it would likely trigger a compile and that, given @majosm's measurements, that compile was in addition to the "normal" from-source-for-execution build.
Important question: are all these conclusions still valid for the Nvidia target? They seem device-unspecific, but I don't know how a generic kernel would be different from a size-specific one in the GPU case.
At any rate, at least for CPU, we can probably save time by skipping pyopencl's binary cache if we're working with pocl.
Seems like the time spent in get_info(BINARIES) is much higher for CPUs than it is for GPUs. For combozzle on Lassen I'm seeing sub-millisecond times when running on the GPU, and up to 40s when running on the CPU.