from-python-to-numpy [4.1 Introduction]: why `add_python` is faster than `add

I found an opposite conclusion when running the example code in 4.1 Introduction, following code is my results tested in IPython 6.4.0 with Python 3.6.5 and Numpy 1.14.3:

In [1]: import numpy as np

In [2]: import random

In [3]: def add_python(Z1,Z2):
   ...:     return [z1+z2 for (z1,z2) in zip(Z1,Z2)]
   ...: 
   ...: def add_numpy(Z1,Z2):
   ...:     return np.add(Z1,Z2)
   ...: 

In [4]: Z1 = random.sample(range(1000), 100)

In [5]: Z2 = random.sample(range(1000), 100)

# For Python lists `Z1`, `Z2`, `add_python` is faster
In [6]: %timeit add_python(Z1, Z2)
8.25 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: %timeit add_numpy(Z1, Z2)
16.9 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: a = np.random.randint(0, 1000, size=100)

In [9]: b = np.random.randint(0, 1000, size=100)
# For Numpy array `a`, `b`, `add_numpy` is faster
In [10]: %timeit add_python(a, b)
22.6 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit add_numpy(a, b)
851 ns ± 6.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Aug 03 '18 04:08 bingyao

Interesting. I re-tested it using Python 3.7 and I got:

In [8]: %timeit add_python(Z1,Z2)
8.88 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [9]: %timeit add_numpy(Z1,Z2)
14.4 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Aug 07 '18 07:08 rougier

The same thing for me. Using standard python arrays (python 3.7, Mac OS Mojave)

%timeit add_python(Z1, Z2)
6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit add_numpy(Z1, Z2)
11.1 µs ± 46.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using np.arrays instead, the timings change in an interesting way:

%timeit add_python(Z3, Z4)
28.5 µs ± 996 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit add_numpy(Z3, Z4)
540 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.add(Z3, Z4)
488 ns ± 8.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Interestingly the python call overhead starts to really show when doing such micro benchmarks.

So to summarize:

numpy is about twice as slow for me with native python lists
numpy is just as fast as expected, with numpy arrays, and python is about twice as slow with bumpy arrays than with native lists

I'd say that is about as expected, so maybe that is what should be compared in the example instead of trying to do both compute paths with native python lists first?

Dec 13 '18 23:12 dwt

I'd say the examples are just way too small to make the differences really visible. When upscaling the input a bit, I get this:

length = 100000

import random
Z1, Z2 = random.sample(range(length), length), random.sample(range(length), length)

%timeit add_python(Z1, Z2)
%timeit [z1+z2 for (z1,z2) in zip(Z1,Z2)]
19.1 ms ± 514 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.6 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit add_numpy(Z1, Z2)
%timeit np.add(Z1, Z2)
11 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.9 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Z3, Z4 = np.random.sample(length) * 100, np.random.sample(length) * 100

%timeit add_python(Z3, Z4)
%timeit [z3+z4 for (z3,z4) in zip(Z3,Z4)]
16.8 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.7 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit add_numpy(Z3, Z4)
%timeit np.add(Z3, Z4)
43.1 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
42.7 µs ± 278 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Dec 13 '18 23:12 dwt

Nice. Could you make a PR for the book?

Dec 17 '18 09:12 rougier

Sure, but it will probably take me until the Christmas-time.

Dec 17 '18 09:12 dwt

(Also my english is shit, so you will have to improve that probably. Sorry)

Dec 17 '18 09:12 dwt

Mine is the same, not sure I can correct :)

Dec 17 '18 15:12 rougier

Hi @dwt, getting similar results. Can you explain why this is about as expected (due to recent python optimizations on arrays)?

Feb 15 '19 01:02 inamoto85

My thinking is that you have to think about a numpy operation in three parts. Switching from the python to the c layer, doing the actual computation and then switching back to python.

Now the actual computation part is pretty much always faster than doing the same computation in python. BUT if the context switches take more time than you save by doing the computation faster, then the pure python solution can still be faster.

This is why larger lists / arrays / vectors make the context switch to C more worth it, as the savings in the computation can dominate the costs of switching to the C layer.

Feb 15 '19 09:02 dwt

Thank you for the explanation!

Feb 16 '19 06:02 inamoto85

I've been playing around with this more today, and it seems that most of the time the python version is faster. My assumption is that addition is already fairly heavily optimized in python, leaving the time dominated by the numpy overhead.

vec_length = 1_000_000
Z1, Z2 = random.sample(range(vec_length), vec_length), random.sample(range(vec_length), vec_length)

# %timeit add_python(Z1, Z2)
# 253 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# %timeit add_numpy(Z1, Z2)
# 501 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I got similar results at different sizes. It might be worth swapping out this example for something more convoluted to make a point:

def add_python(Z1, Z2):
    return [((z1**2 + z2**2)**0.5) + ((z1 + z2)**3) for z1, z2 in zip(Z1, Z2)]

def add_numpy(Z1, Z2):
    return np.sqrt(Z1**2 + Z2**2) + (Z1 + Z2)**3

vec_length = 1_000_000
Z1, Z2 = random.sample(range(vec_length), vec_length), random.sample(range(vec_length), vec_length)
Z1_np, Z2_np = np.array(Z1, dtype=np.float64), np.array(Z2, dtype=np.float64)

%timeit add_python(Z1, Z2)
# 665 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit add_numpy(Z1_np, Z2_np)
# 54.2 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Jan 10 '24 16:01 dr-neptune

I tried again with the simple add version and 1,000,000 elements, and I get:

%timeit add_python(Z1, Z2)
54.6 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit add_numpy(Z1_np, Z2_np)
645 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Jan 22 '24 09:01 rougier

Interesting -- my example is running on python 3.11, windows 10, and numpy 1.24.3. Your results are not only much more apparent, but much faster overall.

Jan 22 '24 20:01 dr-neptune

OSX, macbook M1, Python 3.11, Numpy 1.26.0

Jan 22 '24 21:01 rougier

[4.1 Introduction]: why `add_python` is faster than `add_numpy` for vectorization `add`