cooler icon indicating copy to clipboard operation
cooler copied to clipboard

Unclear Cooler.pixels().fetch behavior

Open StefanoCretti opened this issue 1 year ago • 5 comments

Just wanted to report a behavior which I assume is intentional but might be misleading (or at least it was for me).

When using Cooler.pixels().fetch("<some_genomic_region>"), an instance of the class RangeSelector1D is returned. This selector returns all the pixels where the first bin id belongs to the provided interval. As an example, using the yeast.10kb.cool file in the tests/data folder:

from cooler import Cooler

handle = Cooler("yeast.10kb.cool")
print(handle.pixels().fetch("chrII"))

returns:

        bin1_id  bin2_id  count
36311        33       33   2433
36312        33       34   7410
36313        33       35   3444
36314        33       36    475
36315        33       37   1741
...         ...      ...    ...
128531      114     1209      1
128532      114     1221      2
128533      114     1223      2
128534      114     1224      2
128535      114     1225     12

[92225 rows x 3 columns]

which checks out since:

print(handle.extent("chrII"))  # results in (33, 115)

The issue (in my opinion) is that it does not return all pixels with at least one bin in the region. There can be pixels where the second bin belongs to the region and yet they are not returned.

print(handle.pixels()[:].query("bin2_id >= 33 and bin2_id < 115 and bin1_id < 33"))

returns:

       bin1_id  bin2_id  count
28           0       33      3
29           0       34     41
30           0       35     48
31           0       36      9
32           0       37     61
...        ...      ...    ...
36110       32       65      1
36111       32       75      1
36112       32       90      2
36113       32      101      1
36114       32      113      1

[2460 rows x 3 columns]

I understand why this might be the case. There is no easy way of indexing the bin2_id column, therefore finding these pixels would require iterating and filtering all pixels prior to the retrieved slice. This would clash with the lazy fetching and most likely be slow.

Understanding why this behavior might not change, I believe it should be clearly stated in the documentation (I did not find anything about it but I might have missed it), since one could get misleading results. I would be happy to open a PR to either change or document the behavior if needed.

StefanoCretti avatar Jan 26 '25 18:01 StefanoCretti

I think this behaviour is intended and working as documented.

sel.fetch(region1, region2) […] If region2 is not provided, it is taken to be the same as region1. That means that sel.fetch('chr2:10M-20M') returns the same result as sel.fetch('chr2:10M-20M', 'chr2:10M-20M').

kloetzl avatar Feb 10 '25 12:02 kloetzl

Hey @kloetzl, what you linked is the documentation for matrix().fetch(), not pixels().fetch(). pixels().fetch() does not allow for a second region parameter.

import cooler

handle = cooler.Cooler("test_files/yeast.10kb.cool")
print(handle.pixels().fetch("chrII", "chrII"))

raises TypeError: Cooler.pixels.<locals>._fetch() takes 1 positional argument but 2 were given

StefanoCretti avatar Feb 10 '25 12:02 StefanoCretti

Well caught, but I think the same logic still applies here. .fetch('chrII') returns the subset where all entries are on chrII in both dimensions. So the behaviour is consistent.

~~If you visualize it .fetch() returns a subsquare of the matrix. The alternative behaviour you propose is a cross, I believe.~~

kloetzl avatar Feb 10 '25 13:02 kloetzl

Summoning @nvictus

golobor avatar Feb 10 '25 13:02 golobor

Any updates?

dyinboisry4u avatar Jul 07 '25 12:07 dyinboisry4u