graalpython [performance] GraalPython is slow when running Cython

I've been working on getting GraalPython tested on the Cython CI. It mostly works but it's really slow.

One aspect of this is the time spent running Cython itself. Note that this is pure Python code (so it doesn't involve any interaction with your C API emulation, which I know isn't considered a fast path) - while Cython has the option of compiling itself for speed I haven't done so here for the sake of the report.

For the sake of a demo I've just done checked out the cython repository from github and done

time python cython.py Cython/Compiler/*.py

that just runs cython on a bunch of its own files (but only to the c code generation stage, it doesn't invoke any C compilers).

Some results:

Python 3.11.9
-----------
real    1m3.896s
user    0m55.934s
sys     0m4.580s

GraalPython (from the file "graalpy-24.0.2-linux-amd64.tar.gz" from your releases page)
Python 3.10.13 (Thu Jul 04 12:42:45 UTC 2024)
[Graal, Oracle GraalVM, Java 22.0.2] on linux
--------------
real    8m2.008s
user    21m20.609s
sys     0m19.100s

PyPy (pypy3.10-v7.3.12-linux64)
---------------------------------------------
real    4m18.502s
user    4m10.389s
sys     0m0.938s

The upshot is that GraalPython is about 8 times slower than CPython, (and also uses 3 cores of my CPU most of that time while CPython is largely single-threaded).

I've included PyPy just as another data-point. It's also slower for this case (although not quite as slow as GraalPython) so we're clearly doing something that isn't JIT friendly....

I haven't done any profiling beyond this basic measurement (yet).

I do realise this is essentially an enormous code-dump with the complaint "it's slow", which is never a style of bug report that I'm very impressed with when I'm on the receiving end.

Jul 27 '24 18:07 da-woods

Profiling didn't reveal too much. It's spending a large chunk of time in _visitchildren in TreeVisitor in Visitor.py, but that's not unexpected.

There's somewhere where we use

child_attrs = property(fget=operator.attrgetter('subexprs'))
#instead of 
# @property
# def child_attrs(self):
#    return self.subexprs

changing that made things a bit faster, but not dramatically so. And that's as far as I got

Jul 27 '24 20:07 da-woods

GraalVM seems to have an option --cpusampler to produce profiles, including flame graphs. Maybe that can bring up some hints? https://www.graalvm.org/latest/tools/profiling/

Jul 28 '24 05:07 scoder

GraalVM seems to have an option --cpusampler to produce profiles, including flame graphs. Maybe that can bring up some hints?

Yes I gave those a quick go - they were what pointed out operator.attrgetter. That was the only thing that really stood out as unexpected.

I've attached some example output though

graalcpusample.txt flamegraph.svg

Jul 28 '24 06:07 da-woods

I've improved things on our CI by turning off the JIT with the options --experimental-options --engine.Compilation=false, which seems to make things both faster, and single-core.

But we're clearly doing something what doesn't agree with how GraalPython optimizes things.

Jul 28 '24 16:07 da-woods

If turning off the JIT helps, then it sounds like a deoptimization loop bug (in graalpy). You're most likely doing nothing wrong (unless you're constantly generating new code and evaling it). I'll try to investigate.

Jul 29 '24 07:07 msimacek

Thanks. I don't think it's eval/exec - we use them but very infrequently and the parts they're in don't show up on the profile.

Quick warning - if you do pip install cython I think it will compile itself. This report is just about running it without compiling it. That's easiest to get just by cloning the git repo but NO_CYTHON_COMPILE=true pip install cython also works.

Jul 29 '24 07:07 da-woods

if you do pip install cython I think it will compile itself

It should actually use the Python-any wheel that we distribute on PyPI, i.e. not try to build anything locally.

Jul 29 '24 08:07 scoder