OpenMP not useful with default RNG
To test Codon, this marvelous project, I decided to compute some digits of π using a basic Monte Carlo technique. These by their nature are extremely parallelizable workloads. However, both on my Mac and on a beastly desktop machine with Linux, I find that using @par only ever slows things down.
After some thought, I suspected it to be because of the random number generator at the heart of Monte Carlo methods. It turned out to be true.
I'm wondering now, should users be made aware of this bottleneck? Perhaps a warning if you use the default RNG inside an OpenMP loop.
Truth be told, I started out writing this issue expecting to find a bug, but then found I was the bug.
Here is my code for reference
import random
import statistics
K = 2 # Compute in 2 or 3 dimensions
M = 100 # Samples of π
N = 10_000_000 # Samples per π
Vbox = 2**K
if K == 2:
Vsphere_over_pi = 1.0
elif K == 3:
Vsphere_over_pi = 4.0/3.0
else:
raise SystemExit(f'K = {K} not implemented')
pi_ratio = Vbox / Vsphere_over_pi
def sample_square_norm(rng) -> float:
return sum(rng.uniform(-1, 1)**2 for k in range(K))
def test_inside(rng) -> int:
return int(sample_square_norm(rng) <= 1.0)
def compute_ratio(rng) -> float:
inside = 0
for j in range(N):
inside += test_inside(rng)
return inside / N
pis = []
@par(schedule='static', ordered=False, num_threads=24)
for i in range(M):
rng = random.Random()
pis.append(pi_ratio * compute_ratio(rng))
mean = statistics.mean(pis)
std = statistics.stdev(pis)
print(f'{mean:.12g} ± {std:.5g}')
Another interesting note is that if I collapse the "inside" summation into sum() over a generator, it is much slower (2x).
Hi @lericson -- here's one way to do this:
from time import timing
from threading import Lock, get_native_id
...
with timing('pi calc'): # convenient way to time a block of code
T = 12
rng = [random.Random(i) for i in range(T)]
lock = Lock()
pis = []
@par(schedule='static', ordered=False, num_threads=T)
for i in range(M):
x = pi_ratio * compute_ratio(rng[get_native_id()])
with lock: # should lock this as multiple threads are appending
pis.append(x)
mean = statistics.mean(pis)
std = statistics.stdev(pis)
For me (Mac M1) it takes 9.4 seconds with T=1 and 1.3s with T=12. Warning idea makes sense -- we actually want to do it for a few things related to @par like missing locks, race conditions, etc.
I just added a new pass that optimizes sum(), any() and all() on generators (https://github.com/exaloop/codon/pull/175/commits/810cb33dbba9f654f8b31489c50324ba7d5133c2). LLVM does a lot of these kinds of optimizations, but for some reason passing generators as arguments has always tripped it up. At least now the common uses of these builtins will be just as fast (on your code for example, the performance is identical now both with and without the sum()s). Going to close the issue for now.
That is very cool @arshajii ! Great project.