[4.1 Introduction]: why `add_python` is faster than `add_numpy` for vectorization `add`
I found an opposite conclusion when running the example code in 4.1 Introduction, following code is my results tested in IPython 6.4.0 with Python 3.6.5 and Numpy 1.14.3:
In [1]: import numpy as np
In [2]: import random
In [3]: def add_python(Z1,Z2):
...: return [z1+z2 for (z1,z2) in zip(Z1,Z2)]
...:
...: def add_numpy(Z1,Z2):
...: return np.add(Z1,Z2)
...:
In [4]: Z1 = random.sample(range(1000), 100)
In [5]: Z2 = random.sample(range(1000), 100)
# For Python lists `Z1`, `Z2`, `add_python` is faster
In [6]: %timeit add_python(Z1, Z2)
8.25 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: %timeit add_numpy(Z1, Z2)
16.9 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: a = np.random.randint(0, 1000, size=100)
In [9]: b = np.random.randint(0, 1000, size=100)
# For Numpy array `a`, `b`, `add_numpy` is faster
In [10]: %timeit add_python(a, b)
22.6 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit add_numpy(a, b)
851 ns ± 6.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Interesting. I re-tested it using Python 3.7 and I got:
In [8]: %timeit add_python(Z1,Z2)
8.88 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit add_numpy(Z1,Z2)
14.4 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The same thing for me. Using standard python arrays (python 3.7, Mac OS Mojave)
%timeit add_python(Z1, Z2)
6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit add_numpy(Z1, Z2)
11.1 µs ± 46.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Using np.arrays instead, the timings change in an interesting way:
%timeit add_python(Z3, Z4)
28.5 µs ± 996 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit add_numpy(Z3, Z4)
540 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.add(Z3, Z4)
488 ns ± 8.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Interestingly the python call overhead starts to really show when doing such micro benchmarks.
So to summarize:
- numpy is about twice as slow for me with native python lists
- numpy is just as fast as expected, with numpy arrays, and python is about twice as slow with bumpy arrays than with native lists
I'd say that is about as expected, so maybe that is what should be compared in the example instead of trying to do both compute paths with native python lists first?
I'd say the examples are just way too small to make the differences really visible. When upscaling the input a bit, I get this:
length = 100000
import random
Z1, Z2 = random.sample(range(length), length), random.sample(range(length), length)
%timeit add_python(Z1, Z2)
%timeit [z1+z2 for (z1,z2) in zip(Z1,Z2)]
19.1 ms ± 514 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.6 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit add_numpy(Z1, Z2)
%timeit np.add(Z1, Z2)
11 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.9 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Z3, Z4 = np.random.sample(length) * 100, np.random.sample(length) * 100
%timeit add_python(Z3, Z4)
%timeit [z3+z4 for (z3,z4) in zip(Z3,Z4)]
16.8 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.7 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit add_numpy(Z3, Z4)
%timeit np.add(Z3, Z4)
43.1 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
42.7 µs ± 278 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Nice. Could you make a PR for the book?
Sure, but it will probably take me until the Christmas-time.
(Also my english is shit, so you will have to improve that probably. Sorry)
Mine is the same, not sure I can correct :)
Hi @dwt, getting similar results. Can you explain why this is about as expected (due to recent python optimizations on arrays)?
My thinking is that you have to think about a numpy operation in three parts. Switching from the python to the c layer, doing the actual computation and then switching back to python.
Now the actual computation part is pretty much always faster than doing the same computation in python. BUT if the context switches take more time than you save by doing the computation faster, then the pure python solution can still be faster.
This is why larger lists / arrays / vectors make the context switch to C more worth it, as the savings in the computation can dominate the costs of switching to the C layer.
Thank you for the explanation!
I've been playing around with this more today, and it seems that most of the time the python version is faster. My assumption is that addition is already fairly heavily optimized in python, leaving the time dominated by the numpy overhead.
vec_length = 1_000_000
Z1, Z2 = random.sample(range(vec_length), vec_length), random.sample(range(vec_length), vec_length)
# %timeit add_python(Z1, Z2)
# 253 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# %timeit add_numpy(Z1, Z2)
# 501 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I got similar results at different sizes. It might be worth swapping out this example for something more convoluted to make a point:
def add_python(Z1, Z2):
return [((z1**2 + z2**2)**0.5) + ((z1 + z2)**3) for z1, z2 in zip(Z1, Z2)]
def add_numpy(Z1, Z2):
return np.sqrt(Z1**2 + Z2**2) + (Z1 + Z2)**3
vec_length = 1_000_000
Z1, Z2 = random.sample(range(vec_length), vec_length), random.sample(range(vec_length), vec_length)
Z1_np, Z2_np = np.array(Z1, dtype=np.float64), np.array(Z2, dtype=np.float64)
%timeit add_python(Z1, Z2)
# 665 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit add_numpy(Z1_np, Z2_np)
# 54.2 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I tried again with the simple add version and 1,000,000 elements, and I get:
%timeit add_python(Z1, Z2)
54.6 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit add_numpy(Z1_np, Z2_np)
645 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Interesting -- my example is running on python 3.11, windows 10, and numpy 1.24.3. Your results are not only much more apparent, but much faster overall.
OSX, macbook M1, Python 3.11, Numpy 1.26.0