pySCENIC icon indicating copy to clipboard operation
pySCENIC copied to clipboard

[BUG] unexpected changing of data type in dataframe

Open whoisleehom opened this issue 2 years ago • 5 comments

Describe the bug

the datatype of the dataframe object generated by prune2df may be wrong

Steps to reproduce the behavior

  1. Command run when the error occurred:
df = prune2df(dbs,modules,'./motif_annotation.polish.tbl',auc_threshold=0.02,num_workers=100)
  1. This command showed no error information. However, the data type of the columns TargetGenes became string[pyarrow], as shown below:
Enrichment  AUC                              float64
            NES                              float64
            MotifSimilarityQvalue            float64
            OrthologousIdentity              float64
            Annotation               string[pyarrow]
            Context                  string[pyarrow]
            TargetGenes              string[pyarrow]
            RankAtMax                          int64
dtype: object

which led to the mistake after df2regulons. The output of 'df2regulon(df)' became:


[Regulon(name='Ger_000039(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '3': 1.0, '9': 1.0, '8': 1.0, ',': 1.0, ' ': 1.0, '1': 1.0, '.': 1.0, '6': 1.0, '7': 1.0, '5': 1.0, '2': 1.0, '4': 1.0, ')': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_000039', context=frozenset({'activating'}), score=207.67981294849346, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Ger_002027(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '1': 1.0, '5': 1.0, '4': 1.0, ',': 1.0, ' ': 1.0, '.': 1.0, '2': 1.0, '3': 1.0, '7': 1.0, '8': 1.0, '9': 1.0, ')': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_002027', context=frozenset({'activating'}), score=85.88838629001951, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Ger_002712(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '3': 1.0, '5': 1.0, '9': 1.0, '8': 1.0, '7': 1.0, ',': 1.0, ' ': 1.0, '.': 1.0, '2': 1.0, '4': 1.0, '1': 1.0, ')': 1.0, '6': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_002712', context=frozenset({'activating'}), score=129.436625254587, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Ger_003246(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '4': 1.0, '5': 1.0, '8': 1.0, '2': 1.0, ',': 1.0, ' ': 1.0, '1': 1.0, '.': 1.0, '9': 1.0, '3': 1.0, '6': 1.0, ')': 1.0, '7': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_003246', context=frozenset({'activating'}), score=188.21663463639973, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation='')]

Expected behavior

The data type of the dataframe's column 'TargetGenes' should be 'object', which can avoid such kind of mistake. However, I cannot figure out what happened in the prune2df function that caused this problem.

Here is the information of the installation:

- pySCENIC version:0.12.1+6.g31d51a1
- Installation method: Pip
- Run environment: VScode
- OS: Ubuntu

whoisleehom avatar Sep 21 '23 13:09 whoisleehom

I solved the problem by the following codes:

df.loc[:,('Enrichment','New')]= [0]*len(df)
for i in range(len(df)):
    new_list = ast.literal_eval(df['Enrichment']["TargetGenes"][i])
    df.loc[:,('Enrichment','New')][i] = new_list
df[('Enrichment','TargetGenes')] = df[('Enrichment','New')].copy()

whoisleehom avatar Oct 02 '23 13:10 whoisleehom

Hello,

I'm trying to figure out if I have the same problem. For some reason my aucell matrix is all 0s with a majority of genes not mapping properly. My df2regulons output looks like this:

[Regulon(name='Alx1(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'T': 1.0, 'i': 1.0, 'm': 1.0, 'p': 1.0, '2': 1.0, ',': 1.0, ' ': 1.0, '.': 1.0, '3': 1.0, '7': 1.0, '1': 1.0, '0': 1.0, '6': 1.0, '5': 1.0, '8': 1.0, ')': 1.0, 'D': 1.0, 'a': 1.0, '4': 1.0, '9': 1.0, 'C': 1.0, 'o': 1.0, 'z': 1.0, 'S': 1.0, 'e': 1.0, 'r': 1.0, 'n': 1.0, 'h': 1.0, 'R': 1.0, 'c': 1.0, ']': 1.0, 'N': 1.0, 't': 1.0, 'g': 1.0, 'E': 1.0, 'l': 1.0, 'd': 1.0, 'F': 1.0, 'b': 1.0, 'A': 1.0, 'x': 1.0, 's': 1.0, 'B': 1.0, 'L': 1.0, 'I': 1.0, 'f': 1.0, 'P': 1.0, 'k': 1.0, 'M': 1.0, 'G': 1.0, 'V': 1.0, 'q': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Alx1', context=frozenset({'metacluster_9.9.png', 'activating'}), score=4.036787272144878, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Alx3(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'P': 1.0, 'l': 1.0, 's': 1.0, 'c': 1.0, 'r': 1.0, '4': 1.0, ',': 1.0, ' ': 1.0, '0': 1.0, '.': 1.0, '7': 1.0, '9': 1.0, '6': 1.0, '8': 1.0, '5': 1.0, '1': 1.0, '2': 1.0, '3': 1.0, ')': 1.0, 'C': 1.0, 'p': 1.0, 'a': 1.0, 'I': 1.0, 'n': 1.0, 'M': 1.0, 'e': 1.0, 't': 1.0, 'F': 1.0, 'o': 1.0, 'x': 1.0, 'd': 1.0, 'R': 1.0, 'D': 1.0, 'i': 1.0, 'L': 1.0, 'T': 1.0, 'b': 1.0, 'E': 1.0, 'h': 1.0, 'N': 1.0, 'g': 1.0, 'f': 1.0, 'm': 1.0, 'K': 1.0, ']': 1.0, 'v': 1.0, 'S': 1.0, 'A': 1.0, 'k': 1.0, 'O': 1.0, 'G': 1.0, 'j': 1.0, 'U': 1.0, 'J': 1.0, 'W': 1.0, 'V': 1.0, 'w': 1.0, 'B': 1.0, 'H': 1.0, 'u': 1.0, 'z': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Alx3', context=frozenset({'metacluster_9.26.png', 'activating'}), score=3.343534473214638, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),

Is there a reference for what the expected value looks like? How are you accessing how many genes per regulon you have in your data. Thanks in advance.

Tripfantasy avatar Oct 18 '23 17:10 Tripfantasy

The pandas version might need to be restricted to the 1.5 release: pip install 'pandas<2.0'. From Pandas 2.0, it can use pyarrow as dataframe backend instead of numpy and at least in the earlier versions of Pandas 2.x not everything worked properly with this (not restricted to pySCENIC).

ghuls avatar Dec 12 '23 10:12 ghuls

I tried the solution suggested by Tripfantasy @Tripfantasy , both df2regulons and auc_mtx looked correct when using pandas 2.1.0 @ghuls

wangjiawen2013 avatar Jan 11 '24 05:01 wangjiawen2013

The pandas version might need to be restricted to the 1.5 release: pip install 'pandas<2.0'. From Pandas 2.0, it can use pyarrow as dataframe backend instead of numpy and at least in the earlier versions of Pandas 2.x not everything worked properly with this (not restricted to pySCENIC).

It works for me! thanks!

lcd522 avatar May 28 '24 06:05 lcd522