get_pseudobulk losing obs columns with NA values

Open emdann opened this issue 1 year ago • 1 comments

Hi, It appears that get_pseudobulk loses .obs columns when they contain NAs, even when they are unique for each sample ID.

Here's an example.

import pertpy
import decoupler
adata = pertpy.dt.distance_example()
adata.X.data = np.round(adata.X.data) # doing this just for illustration purposes

In this dataset, where adata.obs['perturbation'] == 'control', the value of adata.obs['target'] is set to NA. Even though all the control cells have the same NA value in target, I lose this column when pseudobulking

> pdata = decoupler.get_pseudobulk(adata, sample_col = 'perturbation', groups_col=None)
> 'target' in pdata.obs
False

but the column is kept when I substitute the NAs

> adata.obs['target'] = np.where(adata.obs['perturbation'] == 'control', 'no-target', adata.obs['target'])
> pdata = decoupler.get_pseudobulk(adata, sample_col = 'perturbation', groups_col=None)
> 'target' in pdata.obs
True

I would expect the function to keep the target columns with NAs in this case

Decoupler version: '1.8.0'

Sep 04 '24 22:09 emdann

Hi @emdann,

Ups! I recently refactored this and introduced this bug, now it should be fixed by using .nunique(dropna=False) in 0dd3da67e681c74e7771b78dc53227370590f23d

Thanks for noticing and reporting it! You can install the latest version from GitHub to try it out. Let me know if anything else breaks.

Sep 10 '24 09:09 PauBadiaM