netZooPy icon indicating copy to clipboard operation
netZooPy copied to clipboard

Panda preprocessing expression

Open michelegentili93 opened this issue 3 years ago • 7 comments

In Panda preprocessing there was a problem with indices. Using gene2idx.get(x, 0) always give you the index 0 if x is missing from gene2idx.get (like a gene in gene expression and not in motif, since gene2idx is build on top of the intersection of expression and motif). Now we use gene_names to both create the indices for self.expression and to access with .loc[] the expression data frame self.expression_data

michelegentili93 avatar Oct 24 '22 20:10 michelegentili93

Hi @michelegentili93 , thanks! I just re-based the PR to the devel branch.

marouenbg avatar Oct 25 '22 00:10 marouenbg

Thanks @michelegentili93, that's a great catch, so this affects cases where genes are in expression but not in motif and sets them all to the expression of the first gene.

marouenbg avatar Oct 25 '22 17:10 marouenbg

Correct :) Thank you for the Python implementation and maintenance!

Il giorno mar 25 ott 2022 alle ore 13:46 Marouen @.***> ha scritto:

Thanks @michelegentili93 https://github.com/michelegentili93, that's a great catch, so this affects cases where genes are in expression but not in motif and sets them all to the expression of the first gene.

— Reply to this email directly, view it on GitHub https://github.com/netZoo/netZooPy/pull/275#issuecomment-1290931493, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADS5W64ZM5IJEC5DJ37RUPDWFAMIRANCNFSM6AAAAAARNKYFNY . You are receiving this because you were mentioned.Message ID: @.***>

michelegentili93 avatar Oct 25 '22 17:10 michelegentili93

@michelegentili93 How did you find out about this bug, did you get an error while running panda?

marouenbg avatar Oct 27 '22 16:10 marouenbg

@michelegentili93 How did you find out about this bug, did you get an error while running panda?

I was running PUMA giving the df_correlation_matrix as input. And I noticed the values weren't the same.

michelegentili93 avatar Oct 27 '22 18:10 michelegentili93

Codecov Report

Base: 54.50% // Head: 54.74% // Increases project coverage by +0.23% :tada:

Coverage data is based on head (4a86b8d) compared to base (793d88f). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##            devel     #275      +/-   ##
==========================================
+ Coverage   54.50%   54.74%   +0.23%     
==========================================
  Files          37       37              
  Lines        2343     2351       +8     
==========================================
+ Hits         1277     1287      +10     
+ Misses       1066     1064       -2     
Impacted Files Coverage Δ
netZooPy/panda/panda.py 76.04% <100.00%> (+1.39%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Oct 27 '22 22:10 codecov[bot]

@violafanfani this is good to go. When aligning TFs or genes using processing mode 'intersection', and when gene expression has a nonoverlapping set of genes, these genes get assigned to index 0 instead of being discarded, I fixed it by simply restricting the indices to those of the intersection. In Matlab, this is equivalent to https://github.com/netZoo/netZooM/blob/master/netZooM/tools/processData.m#L120

This affects the first gene and first tf of panda and puma networks when gene expression has genes not present in motif and when they're run with 'intersection'. I've also added new MATLAB ground truth results and everything passes to 12 decimal digit in relative tolerance.

Please make this as a separate release after you release the GPU fix.

marouenbg avatar Oct 28 '22 00:10 marouenbg

Ok, great job! Thanks for helping on this.

violafanfani avatar Oct 28 '22 13:10 violafanfani