[performance] GraalPython is slow when running Cython
I've been working on getting GraalPython tested on the Cython CI. It mostly works but it's really slow.
One aspect of this is the time spent running Cython itself. Note that this is pure Python code (so it doesn't involve any interaction with your C API emulation, which I know isn't considered a fast path) - while Cython has the option of compiling itself for speed I haven't done so here for the sake of the report.
For the sake of a demo I've just done checked out the cython repository from github and done
time python cython.py Cython/Compiler/*.py
that just runs cython on a bunch of its own files (but only to the c code generation stage, it doesn't invoke any C compilers).
Some results:
Python 3.11.9
-----------
real 1m3.896s
user 0m55.934s
sys 0m4.580s
GraalPython (from the file "graalpy-24.0.2-linux-amd64.tar.gz" from your releases page)
Python 3.10.13 (Thu Jul 04 12:42:45 UTC 2024)
[Graal, Oracle GraalVM, Java 22.0.2] on linux
--------------
real 8m2.008s
user 21m20.609s
sys 0m19.100s
PyPy (pypy3.10-v7.3.12-linux64)
---------------------------------------------
real 4m18.502s
user 4m10.389s
sys 0m0.938s
The upshot is that GraalPython is about 8 times slower than CPython, (and also uses 3 cores of my CPU most of that time while CPython is largely single-threaded).
I've included PyPy just as another data-point. It's also slower for this case (although not quite as slow as GraalPython) so we're clearly doing something that isn't JIT friendly....
I haven't done any profiling beyond this basic measurement (yet).
I do realise this is essentially an enormous code-dump with the complaint "it's slow", which is never a style of bug report that I'm very impressed with when I'm on the receiving end.
Profiling didn't reveal too much. It's spending a large chunk of time in _visitchildren in TreeVisitor in Visitor.py, but that's not unexpected.
There's somewhere where we use
child_attrs = property(fget=operator.attrgetter('subexprs'))
#instead of
# @property
# def child_attrs(self):
# return self.subexprs
changing that made things a bit faster, but not dramatically so. And that's as far as I got
GraalVM seems to have an option --cpusampler to produce profiles, including flame graphs. Maybe that can bring up some hints?
https://www.graalvm.org/latest/tools/profiling/
GraalVM seems to have an option
--cpusamplerto produce profiles, including flame graphs. Maybe that can bring up some hints?
Yes I gave those a quick go - they were what pointed out operator.attrgetter. That was the only thing that really stood out as unexpected.
I've attached some example output though
I've improved things on our CI by turning off the JIT with the options --experimental-options --engine.Compilation=false, which seems to make things both faster, and single-core.
But we're clearly doing something what doesn't agree with how GraalPython optimizes things.
If turning off the JIT helps, then it sounds like a deoptimization loop bug (in graalpy). You're most likely doing nothing wrong (unless you're constantly generating new code and evaling it). I'll try to investigate.
Thanks. I don't think it's eval/exec - we use them but very infrequently and the parts they're in don't show up on the profile.
Quick warning - if you do pip install cython I think it will compile itself. This report is just about running it without compiling it. That's easiest to get just by cloning the git repo but NO_CYTHON_COMPILE=true pip install cython also works.
if you do
pip install cythonI think it will compile itself
It should actually use the Python-any wheel that we distribute on PyPI, i.e. not try to build anything locally.