BPR results not reproducible using multiple threads

Open Bernhard-Steindl opened this issue 1 year ago • 0 comments

Hello,

Thank you for sharing this great library with us! 😊 I am using implicit version 0.7.2 and I think I found a bug.

If I create two instances of BPR using the same arguments, with a fixed random_state and when using more than one thread via num_threads, I do get different user and item factors after fitting. But, when I use just one thread this issue does not occur. It seems that BPR produces non-deterministic results when using multiple threads.

I suspect that there may be some bug using the RNGVector class in bpr.pyx (c++ RNG object). https://github.com/benfred/implicit/blob/v0.7.2/implicit/cpu/bpr.pyx#L43 A Numpy Random Generator is seeded using the supplied random_state argument (check_random_state in utils.py). https://github.com/benfred/implicit/blob/v0.7.2/implicit/utils.py#L83 Each thread is supposed to get a RNGVector instance using a different random seed by using the Random Generator. https://github.com/benfred/implicit/blob/v0.7.2/implicit/cpu/bpr.pyx#L185-L187

But, this issue does not occur for me using the LMF matrix factorization model, though it also uses a c++ RNG object. https://github.com/benfred/implicit/blob/v0.7.2/implicit/cpu/lmf.pyx#L40

There is a difference between BPR and LMF regarding the generation of random numbers with the RNGVector class.

LMF uses in lmf_update when running in parallel the function threadid from cython.parallel to obtain the thread id, which is then used to generate a random number using RNGVector with rng.generate(thread_id). https://github.com/benfred/implicit/blob/v0.7.2/implicit/cpu/lmf.pyx#L250
In contrast, BPR obtains in bpr_update the thread id by using an external function get_thread_num in bpr.h, which returns omp_get_thread_num if the _OPENMP macro is defined, otherwise it returns 0. https://github.com/benfred/implicit/blob/v0.7.2/implicit/cpu/bpr.pyx#L267 https://github.com/benfred/implicit/blob/v0.7.2/implicit/cpu/bpr.h#L10-L20

Maybe this is the reason for the non-deterministic behaviour of BPR?

Best regards, Bernhard

Minimal example of creating 2 BPR instances using a fixed seed and multiple threads. The resulting user and item factors are different after fitting the models. Assertion errors occur.

import implicit
from threadpoolctl import threadpool_limits
from scipy.sparse import csr_matrix
import numpy as np

print('implicit version', implicit.__version__, '\n')

raw = [
    [1, 1, 0, 1, 0, 0],
    [0, 1, 1, 1, 0, 0],
    [1, 0, 1, 0, 0, 0],
    [1, 1, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 1],
    [0, 1, 0, 0, 0, 1],
    [0, 0, 0, 0, 1, 1],
]

def get_bpr_model():
    return implicit.bpr.BayesianPersonalizedRanking(factors=3,
                                       learning_rate=1e-3,
                                       regularization=1e-3,
                                       use_gpu=False,
                                       verify_negative_samples=True,
                                       iterations=5,
                                       num_threads=2,          # Assertions fail if num_threads > 1
                                       random_state=42)

with threadpool_limits(limits=1):

    model_A = get_bpr_model()
    model_A.fit(csr_matrix(raw), show_progress=False)
    print(model_A.item_factors)
    print(model_A.user_factors)
    print(model_A.user_factors.dot(model_A.item_factors.T))

    model_B = get_bpr_model()
    model_B.fit(csr_matrix(raw), show_progress=False)
    print(model_B.item_factors)
    print(model_B.user_factors)
    print(model_B.user_factors.dot(model_B.item_factors.T))


assert np.allclose(model_A.user_factors, model_B.user_factors, atol=1e-2)
assert np.allclose(model_A.item_factors, model_B.item_factors, atol=1e-2)

assert np.allclose(model_A.user_factors, model_B.user_factors, atol=1e-3)
assert np.allclose(model_A.item_factors, model_B.item_factors, atol=1e-3)

assert np.allclose(model_A.user_factors, model_B.user_factors, atol=1e-4)
assert np.allclose(model_A.item_factors, model_B.item_factors, atol=1e-4)

assert np.allclose(model_A.user_factors, model_B.user_factors, atol=1e-5)
assert np.allclose(model_A.item_factors, model_B.item_factors, atol=1e-5)

Output:

implicit version 0.7.2 

[[-0.13629021  0.09140649  0.05145105 -0.01928077]
 [-0.02253029  0.11990372 -0.13830613  0.06782728]
 [-0.10003278 -0.13561453  0.00902637  0.15518866]
 [ 0.07921306  0.08666488  0.07246038  0.09480929]
 [ 0.00444893 -0.1239751   0.11326855 -0.01699392]
 [-0.00050317 -0.04274316 -0.10560265  0.14299433]]
[[ 0.09415838  0.0487572  -0.03222072  1.        ]
 [ 0.01535796 -0.01886933 -0.01671766  1.        ]
 [-0.13623102  0.01816958  0.12942049  1.        ]
 [ 0.11905595  0.10971054 -0.07423817  1.        ]
 [-0.1114961   0.08572128  0.06691711  1.        ]
 [-0.14401852  0.15679215 -0.01908839  1.        ]
 [ 0.05942317  0.09276801  0.08664715  1.        ]]
[[-0.0293147   0.07600836  0.13886672  0.10415867 -0.02626929  0.14426552]
 [-0.02395883  0.06753092  0.15606043  0.09317917 -0.01647985  0.14555857]
 [ 0.00760582  0.05517557  0.16752037  0.09497053 -0.00519331  0.1286191 ]
 [-0.02929831  0.08856721  0.12773071  0.1083688  -0.03847447  0.14608479]
 [ 0.00719349  0.07136258  0.1553209   0.0982552  -0.02053766  0.13231981]
 [ 0.01369725  0.09251206  0.14815965  0.09560636 -0.03923509  0.13838078]
 [-0.01444188  0.06562787  0.13744582  0.11383459 -0.01841608  0.12984906]]
[[-0.13642174  0.09134811  0.05149103 -0.01928074]
 [-0.02247692  0.11985338 -0.1383458   0.06830399]
 [-0.10003294 -0.1356145   0.00902635  0.15518874]
 [ 0.07921314  0.08666489  0.07246038  0.09480958]
 [ 0.00444896 -0.12397511  0.11326852 -0.01699418]
 [-0.00050322 -0.04274315 -0.10560266  0.14299443]]
[[ 0.0941582   0.04875714 -0.03222046  1.        ]
 [ 0.01535813 -0.01886933 -0.01671774  1.        ]
 [-0.1362314   0.01816951  0.12942068  1.        ]
 [ 0.11917121  0.10970803 -0.07422689  1.        ]
 [-0.11149608  0.08572138  0.06691711  1.        ]
 [-0.14401817  0.15679215 -0.01908856  1.        ]
 [ 0.05942311  0.092768    0.08664716  1.        ]]
[[-0.02933116  0.07648887  0.1388668   0.10415898 -0.02626951  0.14426559]
 [-0.02396042  0.06801005  0.15606049  0.09317947 -0.01648012  0.14555869]
 [ 0.00762793  0.05563892  0.16752052  0.0949708  -0.00519355  0.1286192 ]
 [-0.02933869  0.08904324  0.12771969  0.10837884 -0.03847263  0.14608376]
 [ 0.00720586  0.07182638  0.15532097  0.0982555  -0.02053794  0.13231991]
 [ 0.01370624  0.09297396  0.14815971  0.09560666 -0.03923537  0.13838091]
 [-0.01445162  0.06609963  0.13744588  0.11383489 -0.01841634  0.12984917]]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[15], line 50
     47 assert np.allclose(model_A.user_factors, model_B.user_factors, atol=1e-3)
     48 assert np.allclose(model_A.item_factors, model_B.item_factors, atol=1e-3)
---> 50 assert np.allclose(model_A.user_factors, model_B.user_factors, atol=1e-4)
     51 assert np.allclose(model_A.item_factors, model_B.item_factors, atol=1e-4)
     53 assert np.allclose(model_A.user_factors, model_B.user_factors, atol=1e-5)

AssertionError:

I am using a win64 architecture and this conda environment:

name: my-conda-env
channels:
  - conda-forge
  - nodefaults
dependencies:
  - implicit=0.7.2
  - libblas=3.9.0=21_win64_mkl
  - mkl=2024.0.0=h66d3029_49657
  - numpy=1.24.4
  - pandas=2.0.3
  - pip=24.0
  - python=3.8.18
  - scipy=1.10.1
  - pip:
      - ipykernel==6.29.2

Mar 07 '24 09:03 Bernhard-Steindl