scikit-learn Weights are being normalized using number of samples as opposed to sum in GaussianMixture

Describe the bug

Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using n_samples. It should be done using weights.sum() as done in _m_step() here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.

Steps/Code to Reproduce

Weights are being normalized at Line https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L718 using n_samples. It should be done using weights.sum() as done in _m_step() here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L756.

Expected Results

Correct weights.

Actual Results

Incorrect weights.

Versions

System:
    python: 3.9.13 (main, May 24 2022, 21:28:31)  [Clang 13.1.6 (clang-1316.0.21.2)]
executable: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/bin/python
   machine: macOS-12.4-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 22.2.1
   setuptools: 62.3.2
        numpy: 1.23.1
        scipy: 1.8.1
       Cython: 0.29.30
       pandas: 1.4.3
   matplotlib: 3.5.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: False

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.20
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/kshitijgoel/Documents/main/code.nosync/self_organizing_gmm/.venv/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

Aug 02 '22 16:08 kshitijgoel007

I can work on this if still available.

Aug 04 '22 22:08 kasmith11

Seems like a simple fix. Note that a similar change was made in PR #23034, addressing an error reported in #23032.

Aug 05 '22 00:08 Micky774

Hi @kshitijgoel007, thank you for reporting this.

Do you by change have a minimal reproducible example where this incorrect weighting cause problem?

Aug 08 '22 15:08 jjerphan

I can work on this, if he has not yet been finished.

Oct 31 '22 19:10 jsawalha

Please see my review on the PR but I thought I would write my comments coherently here.

I would not re-normalize user provided weights_init, otherwise we take away the ability to precisely define an initial state for the EM algorithm. It is up to the user to provide properly defined and normalized weights_init.
All the current initialization options boil down to passing in a normalized responsibilities through _estimate_gaussian_parameters (I tested this, see below), it's worth considering that we are being as generic as possible for this task.

To that end, it would be good to know the range of output weights currently possible from a normalized responsibilities matrix. I guess the simple sum is the best way to go about this to avoid floating point errors with precise 1 for n=1 features. (Although we could consider allowing specifying a normalization callable, which by default would be sum(). I guess this would also apply to the normalization at each _m step, but there are no discussions or feature requests about it.)

I have a PR #24812 for some requested features passing in user provided responsibilities or a callable initializer which creates a responsibilities. I would be happy if the normalization changes here keep that use case in mind, but since both additions also just boil down to passing in a normalized responsibilities matrix as resp, it should not be a problem.

What can help though is the responsibilities helper and validation functions that are a part of #24812. I debugged the various initial states provided by the initialization options and looked at the responsibility matrices they create. (Surprisingly to me, some are 0->1 continuous, while others are 0,1 binary/discrete), but overall the range is always [0,1] and the normalization is such that each row (responsibilities attached to each component) always sums to 1, since for each sample with n-features we have a PDF. See below: (The PR also adds basic tests. Since some here are familiar with the Mixture codebase it would be great to have a second pair of eyes)

Helper to calculate normalized responsibilities:

  def get_responsibilities(n_samples, n_components, indices=None, labels=None):
      """Create correct shaped responsibilities from array of `indices` or `labels`.
  
      Responsibilities are 1 for the component of the corresponding index or label,
      and 0 otherwise. Note that `n_components` is not the number of features
      in the data, but the number of components in the mixture model. Responsibilities
      can be used to calculate initial weights, means, and precisions of the mixture
      model components. Either `indices` or `labels` must be provided.
  
      Parameters
      ----------
      n_samples : int
          Number of samples.
  
      n_components : int
          Number of components.
  
      indices : array-like of shape (n_components,), default=None
          The index location of the chosen components (centers) in the data array X
          of shape (n_samples, n_components), will be set to 1 in the output.
          Either `indices` or `labels` must be provided.
  
      labels : array-like of shape (n_samples,), default=None
          Will be used over `indices` if not `None`. The labels i=0 to n_components-1
          will be set to 1 for each sample in the ouput.
          Either `indices` or `labels` must be provided.
  
      Returns
      -------
      responsibilities : array, shape (n_samples, n_components)
          Responsibilities of each sample in each component.
      """
      resp = np.zeros((n_samples, n_components))
      if labels is not None:
          _check_shape(labels, (n_samples,), "labels")  # will raise if incompatible
          resp[np.arange(n_samples), labels] = 1
      elif indices is not None:
          resp[indices, np.arange(n_components)] = 1
      else:
          raise ValueError(
              "Either `indices` or `labels` must be provided, both were `None`."
          )
      return resp

Validator for normalized responsibilities:

  def _check_responsibilities(resp, n_components, n_samples):
      """Check the user provided 'resp'.
  
      Parameters
      ----------
      resp : array-like of shape (n_samples, n_components)
          The responsibilities for each data sample in terms of each component.
  
      n_components : int
          Number of components.
  
      n_samples : int
          Number of samples.
  
      Returns
      -------
      resp : array, shape (n_samples, n_components)
      """
      resp = check_array(resp, dtype=[np.float64, np.float32], ensure_2d=False)
  
  
      _check_shape(resp, (n_samples, n_components), "responsibilities")
  
  
      # check range
      axis_sum = resp.sum(axis=1)
      less_1 = np.allclose(axis_sum[axis_sum > 1], 1)
      positive = np.allclose(axis_sum[axis_sum < 0] + 1, 1)
      in_domain = positive and less_1
      if not in_domain:
          raise ValueError(
              "The parameter 'responsibilities' should be normalized in "
              "the range [0, 1] within floating point tolerance, but got: "
              f"max value {np.min(resp)}, min value {np.max(resp)}"
          )
  
  
      # check proper normalization exists
      nrows_1 = np.sum(np.isclose(axis_sum[axis_sum >= 1], 1))
      if nrows_1 < n_components:
          raise ValueError(
              "The parameter 'responsibilities' should be normalized, "
              f"with at least `n_components`={n_components} rows summing to 1."
              f"but got {nrows_1}."
          )
      return resp

Nov 14 '22 10:11 emirkmo

@emirkmo I kind of agree with the principle of not renormalizing user-defined weights.

What I am wondering however is that it makes little sense to have unnormalized weight at the init that will get normalized in the next M-steps. Intuitively, I would expect the same normalization (or non-normalization) for the full iterative algorithm.

Dec 29 '22 10:12 glemaitre

@glemaitre The init step really is a way to fine tune the path the algorithm will take. For some use cases it actually barely matters at all. But for many others, it can be really important as the EM algorithm is highly sensitive to initial state. (This can be especially important for sparse multi-D data.)

So ultimately I don't think the rest of the algorithm has much to do with the init state. In fact, the initial state does not even have to be super well defined, as the M-steps seem capable of handling that in my testing.

What is most important is, (without creating bugs like overflow that can lead to negative normalization in the M-steps), allowing precisely specifying the init state, and providing tools for users to do this easily/intelligently as well. (I'll shamelessly link to #24812).

Otherwise, it is like trying to stack magnetic blocks. As you go to place it just so, the system shifts by itself and the end result isn't what you want. Very frustrating.

Jan 11 '23 17:01 emirkmo