decoupler-py icon indicating copy to clipboard operation
decoupler-py copied to clipboard

Python and R versions of package giving different results

Open shaln opened this issue 1 year ago • 1 comments

Hi, I wanted to first and foremost express my appreciation for this fantastic package!

I am trying to apply the TF activity inference functions on my 10X scRNA-seq data. For reference, my dataset has 4 conditions and each condition has 2 timepoints (total 10 samples).

Prior to decoupler, I had processed the scRNA-seq data using Seurat v5 following the SCTransform v2 tutorial, and applied decoupler to Seurat objects obtained after running the PrepSCTFindMarkers function in Seurat.

For my first run of decoupler, I tried running the R version on a Seurat object consisting of only one of the conditions with 2 timepoints - for clarity, let's call this Condition 1, which has D7 and D9. This Seurat object was processed the same way as described above. I obtained the heatmap in Figure A below, of the top 3 TF.

Since the Python implementation is more memory efficient and offers the function to run pseudobulk analysis downstream, which is what I want to do for the Seurat object containing all samples and conditions, I decided to try running the Python version. I ran the Python version on the exact same Seurat object for Condition 1 that I had used for the R version (converted to .h5ad format using the sceasy package). I was surprised to see the Python and R versions giving different results for the same dataset, as shown in the heatmap below the top 3 TFs for each cluster is different between the Python (Figure B) and R versions of decoupler (Figure A).

Picture 1

Is this a bug or expected? I'm not quite sure what could be causing the discrepancy apart from something in the Seurat object to anndata conversion process or differences underneath the hood between the R and Python versions.. I'd appreciate it if you could help clarify this :) Thanks!

shaln avatar Aug 26 '24 15:08 shaln

Hi @shaln,

Sorry for the delayed reply, I was on holidays. The observed discrepancies are at how the vignettes present the results but the activity values should be the same (I am assuming that you are comparing sc vs sc, and not sc vs pseudobulk). In the R vignette we plot the top 25 more variable TFs at the activity level, while in the python version we plot the most active TFs per cluster label. If you look at OLIG2, you can see that it has the same distribution of values in both plots. Personally I prefer the python one since you get a clearer picture per cluster (in the R one it might be the case that variable TFs are over-represented in a subset of clusters). Another source of variability could be that during the conversion some genes are removed, this will affect the background of enrichment methods and results could change slightly.

Hope this is helpful! Let me know if you have further questions.

PauBadiaM avatar Sep 10 '24 08:09 PauBadiaM

Thanks @PauBadiaM for the clear explanation! It is very helpful indeed :)

I have a separate but relevant question. I mentioned in my original post that I processed the scRNA-seq data using Seurat v5 following the SCTransform v2 tutorial, and applied decoupler to Seurat objects obtained after running the PrepSCTFindMarkers function in Seurat. My understanding is that PrepSCTFindMarkers recalculates the corrected counts by reversing the regressing model by using minimum of median sequencing depth. Would it thus be more appropriate to use the data before OR after running PrepSCTFindMarkers?

I tried to read up what decoupler-py does under the hood but I'm still trying to wrap my head round how PrepSCTFindMarkers would/wouldn't affect the results. I have also tried running decoupler-py on data with and without PrepSCTFindMarkers, and I do get rather different results.

Top 3 TFs per cluster (without PrepSCTFindMarkers)

{'0': ['CEBPA', 'JUN', 'KAT7'], '1': ['LMO2', 'MEF2C', 'GFI1B'], '2': ['KLF9', 'GBX2', 'ELK3'], '3': ['GCM1', 'VHL', 'SP4'], '4': ['E2F4', 'E2F1', 'ZNF143'], '5': ['KDM5A', 'FOXN1', 'KAT5'], '6': ['ZNF148', 'PAWR', 'HIPK2'], '7': ['HDAC9', 'HIF3A', 'HDAC7']}

Top 3 TFs per cluster (with PrepSCTFindMarkers)

{'0': ['SMAD4', 'SMAD3', 'SMAD2'], '1': ['MEF2C', 'LMO2', 'GFI1B'], '2': ['MSX2', 'NFYB', 'ZNF91'], '3': ['SP4', 'GCM1', 'VHL'], '4': ['E2F4', 'E2F1', 'E2F3'], '5': ['TFDP1', 'E2F4', 'OLIG2'], '6': ['KLF8', 'TAL1', 'ZNF148'], '7': ['ATOH7', 'IRX4', 'HDAC9']}

Given that Seurat requires PrepSCTFindMarkers to be run prior to FindMarkers or any differential gene expression analysis, logic follows that the data after running PrepSCTFindMarkers would be the one to feed into the decoupler package. However, I wanted check here to make sure I'm feeding in the appropriate data.

Thank you :) I'm really enjoying using the package thus far!

shaln avatar Oct 22 '24 11:10 shaln

Hi @shaln,

Again, sorry for the delayed response. Regarding your question, I am not sure actually, unfortunately I am not familiar with the new Seurat normalization. If you want to play it safe, use classic log-normalized gene expression, or contrast statistics from DEG (but not from marker gene extraction). Sorry!

PauBadiaM avatar Nov 29 '24 13:11 PauBadiaM