decoupler-py icon indicating copy to clipboard operation
decoupler-py copied to clipboard

get_pseudobulk losing obs columns with NA values

Open emdann opened this issue 1 year ago • 1 comments

Hi, It appears that get_pseudobulk loses .obs columns when they contain NAs, even when they are unique for each sample ID.

Here's an example.

import pertpy
import decoupler
adata = pertpy.dt.distance_example()
adata.X.data = np.round(adata.X.data) # doing this just for illustration purposes

In this dataset, where adata.obs['perturbation'] == 'control', the value of adata.obs['target'] is set to NA. Even though all the control cells have the same NA value in target, I lose this column when pseudobulking

> pdata = decoupler.get_pseudobulk(adata, sample_col = 'perturbation', groups_col=None)
> 'target' in pdata.obs
False

but the column is kept when I substitute the NAs

> adata.obs['target'] = np.where(adata.obs['perturbation'] == 'control', 'no-target', adata.obs['target'])
> pdata = decoupler.get_pseudobulk(adata, sample_col = 'perturbation', groups_col=None)
> 'target' in pdata.obs
True

I would expect the function to keep the target columns with NAs in this case

Decoupler version: '1.8.0'

emdann avatar Sep 04 '24 22:09 emdann

Hi @emdann,

Ups! I recently refactored this and introduced this bug, now it should be fixed by using .nunique(dropna=False) in 0dd3da67e681c74e7771b78dc53227370590f23d

Thanks for noticing and reporting it! You can install the latest version from GitHub to try it out. Let me know if anything else breaks.

PauBadiaM avatar Sep 10 '24 09:09 PauBadiaM