cooltools icon indicating copy to clipboard operation
cooltools copied to clipboard

just a view suggestions

Open dmalzl opened this issue 5 years ago • 5 comments

Hi,

After doing Hi-C stuff for a year now I think I finally got some grip on how to do things and since I am a Python person I very much appreciate the work around the cooler universe and especially cooltools for downstream analysis. Despite the usefulness, I came across some shortcoming during working with the suite, especially when it comes to use coolers that contain more than just one set of balancing weights:

  1. Cooler automatically handles KR balancing weights as divisive weights since it expects all KR weights to originate from the Juicer KR implementation. However, the deeptools universe has added a C++ KR implementation as Python extension for their HiCExplorer package, which returns multiplicative weights. Using the cooltools functions and passing KR as balancing column now wrongly treats this as divisive which results in nonsense matrices. Fortunately, this can be circumvented using the divisive_weights argument setting divisive_weights = False. I would therefore suggest to expose this argument in the function interfaces where its necessary to avoid wrong results.

  2. Coolers that contain more than one set of weights are a problem for some of the cooltools functions. In particular, there are situations where you want to have e.g. KR and IC balancing results. Naming the respective vectors KR and IC in the cooler conflicts with some of the cooltools functions in a way that cooltools tries to infer low coverage bins from the weight vectors which is hardcoded as

is_bad_bin = np.isnan(clr.bins().fetch(chrom)["weight"].values)

if this column is not present in the cooler the function simply crashes with a KeyError. Therefore, I would suggest adding a function parameter letting the user specify which weight column to use for bad bin inference.

Thank you for providing this toolset. Best, Daniel

dmalzl avatar Jul 08 '20 10:07 dmalzl

Hi Daniel,

Thanks for the suggestions! We will keep that in mind. We're now in progress of doing a major revision of cooltools, bioframe, and cooler. We talked about being able to provide weights to coolers externally, and this is in the plans. We will try to expose bad bins and weights in cooltools when possible too.

We rarely use KR weights ourselves, which explains why we never encountered this problem. I believe it is there for compatibility with coolers created from .hic files.

Generally, I find it very misleading that people talk about KR weights and IC weights as if they are different things. Technically KR is an algorithm that achieves exactly the same result as IC - a balanced matrix. In practice, neither IC nor KR converge nicely on Hi-C data due to the presence of sparse rows/columns (and if they do, pixels in sparse areas get super high values that confuse many algorithms). So filtering is needed, and both cooler and "KR" software do it. Also, there is a debate about what subset of data to use (cis/everything; remove or keep first 2 diagonals). So the differences are not actually between KR or IC - they are rather between Mirny/Dekker filtering approach vs Aiden lab filtering approach (undocumented as of 2 years ago) vs other labs.

I believe it definitely makes sense to allow for choosing the weights column. I will add it to the develop branch. We may expose divisive_weights in cooltools as well, though it may introduce more clutter. It may be easier to just invert the weights and provide them to the open cooler file externally (we are working on this feature now).

Max

mimakaev avatar Jul 11 '20 16:07 mimakaev

Hi Maksim,

Thanks for the quick reply and sorry for my late one.

It is good to hear about the revision. In terms of KR and divisive_weights this is definitely because of the hic2cool conversion to ensure compatibility with the way the Aiden lab produces those weights. From my comparisons it is also somehow clear that there is not much difference between KR and IC in terms of balancing, I just stuck with the HiCExplorer KR because I started my analysis with KR balanced matrices and as I am a semi-novice in the field I did not know where else to go.

Anyway, I definitely see that the differences are from different filtering methodologies but choosing the weight column would definitely be worth a try as I think there might be some new algorithms in the future.

So yeah thanks for considering and best regards, Daniel

dmalzl avatar Jul 17 '20 12:07 dmalzl

clr_weight_name argument is now available in most tools! I tried to find which ones don't yet, based on the docs:

  • insulation
  • eigenvectors
  • snipping

Notably, in call-dots this argument is named weight-name, and should be renamed to align with the other tools.

If anyone notices any other discrepancies, please report or fix.

Phlya avatar Nov 03 '21 19:11 Phlya

(2) is addressed by @Phlya , but also should be fixed in cooltools.insulation - thank you for pointing this out!

re: (1), it's a surprisingly convoluted issue. Proposed solutions: (a) throw a warning inside cooler, every time a weight column is automatically interpreted as divisive (b) throw an error inside the corresponding cooltools.check.is_balanced (@sergpolly ) if there is a weight column that potentially could be divisive and suggest the user using multiplicative (and well-filtered!!) weights Divisive-looking column names: 4DN_DIVISIVE_WEIGHTS = {"KR", "VC", "VC_SQRT"}

(c) @nvictus : pass a kwarg dict with arguments for cooler {'weight_column', 'is_divisive'}

golobor avatar Nov 03 '21 20:11 golobor

my part is done and merged - i'm un-assigning myself for now

sergpolly avatar Nov 04 '21 19:11 sergpolly