Making pycodestyle faster
I've been on a Python performance optimization kick recently (see
https://github.com/PyCQA/astroid/pull/497), and I'm a pycodestyle
user, so I figured I would give it a look and see if its performance
could be improved at all.
Of course, pycodestyle is already pretty fast, so to give it some
stress I'm testing it out by pointing it right at a large repo, namely
Zulip (https://github.com/zulip/zulip). In particular, my test command
is time ~/pycodestyle/pycodestyle.py -qq ~/zulip.
Here are the times from three runs of master:
real 2m10.664s
user 0m55.333s
sys 1m15.249s
real 2m11.376s
user 0m55.545s
sys 1m15.745s
real 2m11.588s
user 0m55.500s
sys 1m16.044s
I used the yappi profiling library to see if there were any hotspots
in the code. There were. Take a look at the graph below. The brighter
the box, the hotter the spot. In more detail, each box represents a
function and has three numbers: 1) the percentage of total CPU time
spent in that function, 2) the percentage of total CPU time spent in
that function but not its subcalls, and 3) the number of times the
function was called.

The red box that sticks out is Checker.run_check. It is called well
over two million times, and 27.7 of total CPU time is spent there,
almost all over which is in the function itself. This seems like an
awful lot considering how short the function is:
def run_check(self, check, argument_names):
"""Run a check plugin."""
arguments = []
for name in argument_names:
arguments.append(getattr(self, name))
return check(*arguments)
So why does it suck up so much time?
I think I've worked out how it goes. When a check is registered (with
register_check), its arguments are extracted with the inspect
library and stored as a list of strings. When a check is run,
run_check iterates over its associated list of arguments,
dynamically accesses those attributes of the Checker, and then
passes those values to the check to actually run.
The problem here is that dynamic attribute access is slow, and doing it in tight loops is really slow (see https://github.com/PyCQA/astroid/pull/497 for a harrowing cautionary tale on this subject). My idea was to see if there was a way to do away with the dynamic attribute access, basically by "compiling" the attribute access into the code.
It turns out that this can be accomplished by passing the checker
instance into the check as an argument, and then call the attributes
directly on the checker. Implementing this change involves a
large-scale shuffling of arguments and strings, but other than that
not much changes. register_check has to take the check's argument
names as arguments now, since they are no longer the actual arguments.
run_check itself can also be done away with, since all it would have
to do would be to call the check with the checker as an argument, and
that can be done inline.
This change resulted in a substantial speedup:
real 1m28.057s
user 0m40.340s
sys 0m47.669s
real 1m27.843s
user 0m39.910s
sys 0m47.901s
real 1m28.258s
user 0m40.379s
sys 0m47.849s
Here is the resulting yappi graph:

This graph is a lot more colorful than the last one. This means that the work is spread out more evenly among the various functions and there isn't one overwhelmingly critical hotspot.
One function that stuck out to me was Checker.init_checker_state.
After some experimentation, it appeared that despite taking up almost
6% of total CPU time, the function didn't do much. Cutting it provided
a non-negligible speed improvement:
real 1m19.463s
user 0m36.857s
sys 0m42.565s
real 1m19.837s
user 0m36.469s
sys 0m43.329s
real 1m19.521s
user 0m36.462s
sys 0m43.026s
A little further poking around revealed that run_check and
init_checker_state were the only consumers of the "argument names",
so I cut those out too. This led to some nice code simplification and
an ever-so-slight speedup:
real 1m19.686s
user 0m36.516s
sys 0m43.129s
real 1m18.466s
user 0m36.196s
sys 0m42.227s
real 1m19.063s
user 0m36.188s
sys 0m42.846s
Here is the yappi graph after these changes:

The major hotspot is now tokenize.tokenize, which is part of the
standard library. This is good, as it suggests that pycodestyle is
nearing the point of being as fast as it can be. After that, the next
most expensive functions are
-
check_logical, -
generate_tokens, -
build_tokens_line, -
check_all, -
maybe_check_physical, and -
_is_eol_token_.
These functions all feel to me like they are doing something inefficiently, but I don't understand them well enough to say what.
These measurements were all taken running master with
Python 3.6.5 (default, Mar 30 2018, 06:42:10)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin