Making pycodestyle faster

Open nickdrozd opened this issue 7 years ago • 5 comments

I've been on a Python performance optimization kick recently (see https://github.com/PyCQA/astroid/pull/497), and I'm a pycodestyle user, so I figured I would give it a look and see if its performance could be improved at all.

Of course, pycodestyle is already pretty fast, so to give it some stress I'm testing it out by pointing it right at a large repo, namely Zulip (https://github.com/zulip/zulip). In particular, my test command is time ~/pycodestyle/pycodestyle.py -qq ~/zulip.

Here are the times from three runs of master:

real	2m10.664s
user	0m55.333s
sys	1m15.249s

real	2m11.376s
user	0m55.545s
sys	1m15.745s

real	2m11.588s
user	0m55.500s
sys	1m16.044s

I used the yappi profiling library to see if there were any hotspots in the code. There were. Take a look at the graph below. The brighter the box, the hotter the spot. In more detail, each box represents a function and has three numbers: 1) the percentage of total CPU time spent in that function, 2) the percentage of total CPU time spent in that function but not its subcalls, and 3) the number of times the function was called.

pcs-master

The red box that sticks out is Checker.run_check. It is called well over two million times, and 27.7 of total CPU time is spent there, almost all over which is in the function itself. This seems like an awful lot considering how short the function is:

    def run_check(self, check, argument_names):
        """Run a check plugin."""
        arguments = []
        for name in argument_names:
            arguments.append(getattr(self, name))
        return check(*arguments)

So why does it suck up so much time?

I think I've worked out how it goes. When a check is registered (with register_check), its arguments are extracted with the inspect library and stored as a list of strings. When a check is run, run_check iterates over its associated list of arguments, dynamically accesses those attributes of the Checker, and then passes those values to the check to actually run.

The problem here is that dynamic attribute access is slow, and doing it in tight loops is really slow (see https://github.com/PyCQA/astroid/pull/497 for a harrowing cautionary tale on this subject). My idea was to see if there was a way to do away with the dynamic attribute access, basically by "compiling" the attribute access into the code.

It turns out that this can be accomplished by passing the checker instance into the check as an argument, and then call the attributes directly on the checker. Implementing this change involves a large-scale shuffling of arguments and strings, but other than that not much changes. register_check has to take the check's argument names as arguments now, since they are no longer the actual arguments. run_check itself can also be done away with, since all it would have to do would be to call the check with the checker as an argument, and that can be done inline.

This change resulted in a substantial speedup:

real	1m28.057s
user	0m40.340s
sys	0m47.669s

real	1m27.843s
user	0m39.910s
sys	0m47.901s

real	1m28.258s
user	0m40.379s
sys	0m47.849s

Here is the resulting yappi graph:

pcs-check

This graph is a lot more colorful than the last one. This means that the work is spread out more evenly among the various functions and there isn't one overwhelmingly critical hotspot.

One function that stuck out to me was Checker.init_checker_state. After some experimentation, it appeared that despite taking up almost 6% of total CPU time, the function didn't do much. Cutting it provided a non-negligible speed improvement:

real	1m19.463s
user	0m36.857s
sys	0m42.565s

real	1m19.837s
user	0m36.469s
sys	0m43.329s

real	1m19.521s
user	0m36.462s
sys	0m43.026s

A little further poking around revealed that run_check and init_checker_state were the only consumers of the "argument names", so I cut those out too. This led to some nice code simplification and an ever-so-slight speedup:

real	1m19.686s
user	0m36.516s
sys	0m43.129s

real	1m18.466s
user	0m36.196s
sys	0m42.227s

real	1m19.063s
user	0m36.188s
sys	0m42.846s

Here is the yappi graph after these changes:

pcs-check-init

The major hotspot is now tokenize.tokenize, which is part of the standard library. This is good, as it suggests that pycodestyle is nearing the point of being as fast as it can be. After that, the next most expensive functions are

check_logical,
generate_tokens,
build_tokens_line,
check_all,
maybe_check_physical, and
_is_eol_token_.

These functions all feel to me like they are doing something inefficiently, but I don't understand them well enough to say what.

These measurements were all taken running master with

Python 3.6.5 (default, Mar 30 2018, 06:42:10)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin

Apr 15 '18 23:04 nickdrozd