scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Bugfix: Gene score edge case where gene_list gene is chosen as control gene

Open mumichae opened this issue 2 years ago • 6 comments

In some edge cases, the control gene selection retrieves the same gene(s) that are also in the gene_list used for scoring. As a result, when the following line is called, we end up with an empty control gene set, causing the downstream error in #2153 https://github.com/scverse/scanpy/blob/383a61b2db0c45ba622f231f01d0e7546d99566b/scanpy/tools/_score_genes.py#L173

  • [x] Closes #2153
  • [x] Tests included
  • [ ] Release notes not necessary because:

mumichae avatar Feb 21 '24 16:02 mumichae

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 76.50%. Comparing base (4b090c0) to head (97e4024).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2875   +/-   ##
=======================================
  Coverage   76.50%   76.50%           
=======================================
  Files         109      109           
  Lines       12474    12485   +11     
=======================================
+ Hits         9543     9552    +9     
- Misses       2931     2933    +2     
Files Coverage Δ
src/scanpy/preprocessing/_simple.py 87.95% <ø> (ø)
src/scanpy/tools/_score_genes.py 85.10% <86.66%> (-0.44%) :arrow_down:

codecov[bot] avatar Feb 21 '24 16:02 codecov[bot]

Tests are failing, I believe, because the gene_list genes are removed from control genes before random sampling, not after, resulting in a different control gene set. Not quite sure why the difference in scores is so large though.

mumichae avatar Feb 21 '24 16:02 mumichae

@flying-sheep, I have never been too familiar with the score_genes code. Could you take a look at this?

ivirshup avatar Feb 22 '24 12:02 ivirshup

There’s two differences:

  1. np.unique(...) returns the values sorted, pd.Series(...).unique() returns them in original order (this already makes the scores not match)

    This probably changes the sampling, but I wonder why the score difference is so large here! With only that change, I get:

    Arrays are not equal

    Mismatched elements: 2730 / 2730 (100%) Max absolute difference: 0.22674037 Max relative difference: 1581.75673912

  2. what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

    So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

    So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

flying-sheep avatar Feb 26 '24 10:02 flying-sheep

To answer the following question:

  1. what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

    So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

    So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

After going through the original code from Seurat, it seems to me that there's not equivalent to removing genes to be scored from the control gene set. From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation. But please correct me if that's not the case.

So if the original implementation does not remove score genes from the control gene set, we would simply need to remove the following line: https://github.com/scverse/scanpy/blob/ec4457470618efd85da3c7b29f951cab01a49e3a/scanpy/tools/_score_genes.py#L169

(Note: if we want to keep the current behaviour, we should still remove the line above, since it would be redundant)

mumichae avatar Mar 22 '24 15:03 mumichae

About question 1: I find also it strange that the order of cuts that you iterate though should have such a large effect on sampling. Should I change it back to np.unique?

mumichae avatar Apr 29 '24 14:04 mumichae