Bugfix: Gene score edge case where gene_list gene is chosen as control gene
In some edge cases, the control gene selection retrieves the same gene(s) that are also in the gene_list used for scoring. As a result, when the following line is called, we end up with an empty control gene set, causing the downstream error in #2153 https://github.com/scverse/scanpy/blob/383a61b2db0c45ba622f231f01d0e7546d99566b/scanpy/tools/_score_genes.py#L173
- [x] Closes #2153
- [x] Tests included
- [ ] Release notes not necessary because:
Codecov Report
Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
Project coverage is 76.50%. Comparing base (
4b090c0) to head (97e4024).
Additional details and impacted files
@@ Coverage Diff @@
## main #2875 +/- ##
=======================================
Coverage 76.50% 76.50%
=======================================
Files 109 109
Lines 12474 12485 +11
=======================================
+ Hits 9543 9552 +9
- Misses 2931 2933 +2
| Files | Coverage Δ | |
|---|---|---|
| src/scanpy/preprocessing/_simple.py | 87.95% <ø> (ø) |
|
| src/scanpy/tools/_score_genes.py | 85.10% <86.66%> (-0.44%) |
:arrow_down: |
Tests are failing, I believe, because the gene_list genes are removed from control genes before random sampling, not after, resulting in a different control gene set. Not quite sure why the difference in scores is so large though.
@flying-sheep, I have never been too familiar with the score_genes code. Could you take a look at this?
There’s two differences:
-
np.unique(...)returns the values sorted,pd.Series(...).unique()returns them in original order (this already makes the scores not match)This probably changes the sampling, but I wonder why the score difference is so large here! With only that change, I get:
Arrays are not equal
Mismatched elements: 2730 / 2730 (100%) Max absolute difference: 0.22674037 Max relative difference: 1581.75673912
-
what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.
So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.
So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.
To answer the following question:
what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.
So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.
So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.
After going through the original code from Seurat, it seems to me that there's not equivalent to removing genes to be scored from the control gene set. From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation. But please correct me if that's not the case.
So if the original implementation does not remove score genes from the control gene set, we would simply need to remove the following line: https://github.com/scverse/scanpy/blob/ec4457470618efd85da3c7b29f951cab01a49e3a/scanpy/tools/_score_genes.py#L169
(Note: if we want to keep the current behaviour, we should still remove the line above, since it would be redundant)
About question 1: I find also it strange that the order of cuts that you iterate though should have such a large effect on sampling. Should I change it back to np.unique?