[BUG] unexpected changing of data type in dataframe
Describe the bug
the datatype of the dataframe object generated by prune2df may be wrong
Steps to reproduce the behavior
- Command run when the error occurred:
df = prune2df(dbs,modules,'./motif_annotation.polish.tbl',auc_threshold=0.02,num_workers=100)
- This command showed no error information. However, the data type of the columns TargetGenes became string[pyarrow], as shown below:
Enrichment AUC float64
NES float64
MotifSimilarityQvalue float64
OrthologousIdentity float64
Annotation string[pyarrow]
Context string[pyarrow]
TargetGenes string[pyarrow]
RankAtMax int64
dtype: object
which led to the mistake after df2regulons. The output of 'df2regulon(df)' became:
[Regulon(name='Ger_000039(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '3': 1.0, '9': 1.0, '8': 1.0, ',': 1.0, ' ': 1.0, '1': 1.0, '.': 1.0, '6': 1.0, '7': 1.0, '5': 1.0, '2': 1.0, '4': 1.0, ')': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_000039', context=frozenset({'activating'}), score=207.67981294849346, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Ger_002027(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '1': 1.0, '5': 1.0, '4': 1.0, ',': 1.0, ' ': 1.0, '.': 1.0, '2': 1.0, '3': 1.0, '7': 1.0, '8': 1.0, '9': 1.0, ')': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_002027', context=frozenset({'activating'}), score=85.88838629001951, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Ger_002712(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '3': 1.0, '5': 1.0, '9': 1.0, '8': 1.0, '7': 1.0, ',': 1.0, ' ': 1.0, '.': 1.0, '2': 1.0, '4': 1.0, '1': 1.0, ')': 1.0, '6': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_002712', context=frozenset({'activating'}), score=129.436625254587, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Ger_003246(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'G': 1.0, 'e': 1.0, 'r': 1.0, '_': 1.0, '0': 1.0, '4': 1.0, '5': 1.0, '8': 1.0, '2': 1.0, ',': 1.0, ' ': 1.0, '1': 1.0, '.': 1.0, '9': 1.0, '3': 1.0, '6': 1.0, ')': 1.0, '7': 1.0, ']': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ger_003246', context=frozenset({'activating'}), score=188.21663463639973, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation='')]
Expected behavior
The data type of the dataframe's column 'TargetGenes' should be 'object', which can avoid such kind of mistake. However, I cannot figure out what happened in the prune2df function that caused this problem.
Here is the information of the installation:
- pySCENIC version:0.12.1+6.g31d51a1
- Installation method: Pip
- Run environment: VScode
- OS: Ubuntu
I solved the problem by the following codes:
df.loc[:,('Enrichment','New')]= [0]*len(df)
for i in range(len(df)):
new_list = ast.literal_eval(df['Enrichment']["TargetGenes"][i])
df.loc[:,('Enrichment','New')][i] = new_list
df[('Enrichment','TargetGenes')] = df[('Enrichment','New')].copy()
Hello,
I'm trying to figure out if I have the same problem. For some reason my aucell matrix is all 0s with a majority of genes not mapping properly. My df2regulons output looks like this:
[Regulon(name='Alx1(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'T': 1.0, 'i': 1.0, 'm': 1.0, 'p': 1.0, '2': 1.0, ',': 1.0, ' ': 1.0, '.': 1.0, '3': 1.0, '7': 1.0, '1': 1.0, '0': 1.0, '6': 1.0, '5': 1.0, '8': 1.0, ')': 1.0, 'D': 1.0, 'a': 1.0, '4': 1.0, '9': 1.0, 'C': 1.0, 'o': 1.0, 'z': 1.0, 'S': 1.0, 'e': 1.0, 'r': 1.0, 'n': 1.0, 'h': 1.0, 'R': 1.0, 'c': 1.0, ']': 1.0, 'N': 1.0, 't': 1.0, 'g': 1.0, 'E': 1.0, 'l': 1.0, 'd': 1.0, 'F': 1.0, 'b': 1.0, 'A': 1.0, 'x': 1.0, 's': 1.0, 'B': 1.0, 'L': 1.0, 'I': 1.0, 'f': 1.0, 'P': 1.0, 'k': 1.0, 'M': 1.0, 'G': 1.0, 'V': 1.0, 'q': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Alx1', context=frozenset({'metacluster_9.9.png', 'activating'}), score=4.036787272144878, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Alx3(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'P': 1.0, 'l': 1.0, 's': 1.0, 'c': 1.0, 'r': 1.0, '4': 1.0, ',': 1.0, ' ': 1.0, '0': 1.0, '.': 1.0, '7': 1.0, '9': 1.0, '6': 1.0, '8': 1.0, '5': 1.0, '1': 1.0, '2': 1.0, '3': 1.0, ')': 1.0, 'C': 1.0, 'p': 1.0, 'a': 1.0, 'I': 1.0, 'n': 1.0, 'M': 1.0, 'e': 1.0, 't': 1.0, 'F': 1.0, 'o': 1.0, 'x': 1.0, 'd': 1.0, 'R': 1.0, 'D': 1.0, 'i': 1.0, 'L': 1.0, 'T': 1.0, 'b': 1.0, 'E': 1.0, 'h': 1.0, 'N': 1.0, 'g': 1.0, 'f': 1.0, 'm': 1.0, 'K': 1.0, ']': 1.0, 'v': 1.0, 'S': 1.0, 'A': 1.0, 'k': 1.0, 'O': 1.0, 'G': 1.0, 'j': 1.0, 'U': 1.0, 'J': 1.0, 'W': 1.0, 'V': 1.0, 'w': 1.0, 'B': 1.0, 'H': 1.0, 'u': 1.0, 'z': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Alx3', context=frozenset({'metacluster_9.26.png', 'activating'}), score=3.343534473214638, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Is there a reference for what the expected value looks like? How are you accessing how many genes per regulon you have in your data. Thanks in advance.
The pandas version might need to be restricted to the 1.5 release: pip install 'pandas<2.0'. From Pandas 2.0, it can use pyarrow as dataframe backend instead of numpy and at least in the earlier versions of Pandas 2.x not everything worked properly with this (not restricted to pySCENIC).
I tried the solution suggested by Tripfantasy @Tripfantasy , both df2regulons and auc_mtx looked correct when using pandas 2.1.0 @ghuls
The pandas version might need to be restricted to the 1.5 release:
pip install 'pandas<2.0'. From Pandas 2.0, it can use pyarrow as dataframe backend instead of numpy and at least in the earlier versions of Pandas 2.x not everything worked properly with this (not restricted to pySCENIC).
It works for me! thanks!