scanpy Bugfix: Gene score edge case where gene_list gene is chosen as control gene

In some edge cases, the control gene selection retrieves the same gene(s) that are also in the gene_list used for scoring. As a result, when the following line is called, we end up with an empty control gene set, causing the downstream error in #2153 https://github.com/scverse/scanpy/blob/383a61b2db0c45ba622f231f01d0e7546d99566b/scanpy/tools/_score_genes.py#L173

[x] Closes #2153
[x] Tests included

[ ] Release notes not necessary because:

Feb 21 '24 16:02 mumichae

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 76.50%. Comparing base (4b090c0) to head (97e4024).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2875   +/-   ##
=======================================
  Coverage   76.50%   76.50%           
=======================================
  Files         109      109           
  Lines       12474    12485   +11     
=======================================
+ Hits         9543     9552    +9     
- Misses       2931     2933    +2

Files	Coverage Δ
src/scanpy/preprocessing/_simple.py	`87.95% <ø> (ø)`
src/scanpy/tools/_score_genes.py	`85.10% <86.66%> (-0.44%)`	:arrow_down:

Feb 21 '24 16:02 codecov[bot]

Tests are failing, I believe, because the gene_list genes are removed from control genes before random sampling, not after, resulting in a different control gene set. Not quite sure why the difference in scores is so large though.

Feb 21 '24 16:02 mumichae

@flying-sheep, I have never been too familiar with the score_genes code. Could you take a look at this?

Feb 22 '24 12:02 ivirshup

There’s two differences:

np.unique(...) returns the values sorted, pd.Series(...).unique() returns them in original order (this already makes the scores not match)

This probably changes the sampling, but I wonder why the score difference is so large here! With only that change, I get:

Arrays are not equal

Mismatched elements: 2730 / 2730 (100%) Max absolute difference: 0.22674037 Max relative difference: 1581.75673912
what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

Feb 26 '24 10:02 flying-sheep

To answer the following question:

what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

After going through the original code from Seurat, it seems to me that there's not equivalent to removing genes to be scored from the control gene set. From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation. But please correct me if that's not the case.

So if the original implementation does not remove score genes from the control gene set, we would simply need to remove the following line: https://github.com/scverse/scanpy/blob/ec4457470618efd85da3c7b29f951cab01a49e3a/scanpy/tools/_score_genes.py#L169

(Note: if we want to keep the current behaviour, we should still remove the line above, since it would be redundant)

Mar 22 '24 15:03 mumichae

About question 1: I find also it strange that the order of cuts that you iterate though should have such a large effect on sampling. Should I change it back to np.unique?

Apr 29 '24 14:04 mumichae