scenicplus icon indicating copy to clipboard operation
scenicplus copied to clipboard

export_pseudobulk no bed files

Open lililab-uwyo opened this issue 2 years ago • 7 comments

Hello, Seppe and Scenicplus team.

I encountered issues when I performed export_pseudobulk step. I have several multiome libraries. It was working back in July and I just re-run everything with some modification with the defined clusters.

Now, I noticed export_pseudobulk can't generate bed files and I also noticed that #262 and #275 also reported similar issues.

Here is my trials:

  1. Initially, it indicated I need to install pyrle. I installed pyrle, then ran export_pseudobulk with n_cpu = 72 (it works in July). However, there is no bed files generated.
from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'Celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 72,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(tmpDir, 'ray_spill'),
                 split_pattern = '___',
                 use_polars = True)

I printed bw_paths and bed_paths, there are corrected.

print(bed_paths)
{'Agrp': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Agrp.bed.gz', 'Dlx6': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Dlx6.bed.gz', 'Fezf1': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Fezf1.bed.gz', 'Lhx1': '/work/InternalMedicine/***/project_otp/Pomc_CellHashing/scenicplus/scATAC/consensus_peak_calling/pseudobulk_bed_files/Lhx1.bed.gz',
....}

Error output


(export_pseudobulk_ray pid=2386) 2023-12-27 15:12:22,349 cisTopic     INFO     Creating pseudobulk for Slc24a4

(export_pseudobulk_ray pid=2386) /home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
(export_pseudobulk_ray pid=2386)   group_fragments = group_fragments_list[0].append(group_fragments_list[1:])
2023-12-27 15:14:58,951	ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::export_pseudobulk_ray() (pid=2386, ip=172.18.224.56)
  File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 346, in export_pseudobulk_ray
    export_pseudobulk_one_sample(
  File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py", line 285, in export_pseudobulk_one_sample
    group_pr.to_bigwig(
  File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/pyranges_main.py", line 5506, in to_bigwig
    result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
  File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py", line 212, in _to_bigwig
    unique_chromosomes = gr.chromosomes
  File "/home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pandas/core/generic.py", line 5907, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'chromosomes'

  1. I tried to lower n_cpu = 1. The error is "AttributeError: 'DataFrame' object has no attribute 'chromosomes'"
from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'Celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 1,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(tmpDir, 'ray_spill'),
                 split_pattern = '___',
                 use_polars = True)

Error output

2023-12-27 12:48:24,070 cisTopic     INFO     Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_2_Multiome_GEXplusATAC_20220406/outs/atac_fragments.tsv.gz
2023-12-27 12:49:08,646 cisTopic     INFO     Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_3_Multiome_GEXplusATAC_20220902/outs/atac_fragments.tsv.gz
2023-12-27 12:54:47,229 cisTopic     INFO     Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_4_Multiome_GEXplusATAC_20220902/outs/atac_fragments.tsv.gz
2023-12-27 12:57:49,152 cisTopic     INFO     Reading fragments from /work/InternalMedicine/s186179/project_otp/Pomc_CellHashing/Cellranger/PomcCre_Sun1GFP_5_Multiome_GEXplusATAC_20221220/outs/atac_fragments.tsv.gz
2023-12-27 13:01:18,559 cisTopic     INFO     Creating pseudobulk for Agrp

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_18655/3796073447.py in ?()
----> 4 import pyrle
      5 
      6 from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
      7 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs)
    182             num_returns=len(groups),
    183         )
    184         ray.shutdown()
    185     else:
--> 186         [
    187             export_pseudobulk_one_sample(
    188                 cell_data,
    189                 group,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(.0)
    186 def export_pseudobulk(
--> 187     input_data: Union["CistopicObject", pd.DataFrame, Dict[str, pd.DataFrame]],
    188     variable: str,
    189     chromsizes: Union[pd.DataFrame, pr.PyRanges],
    190     bed_path: str,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(cell_data, group, fragments_df_dict, chromsizes, bigwig_path, bed_path, sample_id_col, normalize_bigwig, remove_duplicates, split_pattern)
    281     group_pr = pr.PyRanges(group_fragments)
    282     if isinstance(bigwig_path, str):
    283         bigwig_path_group = os.path.join(bigwig_path, str(group) + ".bw")
    284         if remove_duplicates:
--> 285             group_pr.to_bigwig(
    286                 path=bigwig_path_group,
    287                 chromosome_sizes=chromsizes,
    288                 rpm=normalize_bigwig,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/pyranges_main.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun, chain)
   5502 
   5503         if chromosome_sizes is None:
   5504             chromosome_sizes = pr.data.chromsizes()
   5505 
-> 5506         result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
   5507 
   5508         if dryrun:
   5509             return result

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
    208             new_pyrles[k] = v
    209 
    210         gr = c.defragment().to_ranges()
    211 
--> 212     unique_chromosomes = gr.chromosomes
    213 
    214     subset = ["Chromosome", "Start", "End", "Score"]
    215 

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
   5903             and name not in self._accessors
   5904             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5905         ):
   5906             return self[name]
-> 5907         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'chromosomes'


  1. I also tried export_pseudobulk_one_sample function as you mentioned in #262 . The error is "AttributeError: 'DataFrame' object has no attribute 'chromosomes'".

Here is the code.

print(set(cell_data["sample_id"]))
print(set(cell_data["Celltype"]))
#print(fragment_dict.keys())
print(fragments_dict)
from pycisTopic.utils import (
	read_fragments_from_file,
	prepare_tag_cells
)
import pyranges as pr
import re
import numpy as np

variable = 'Celltype'
sample_id_col = 'sample_id'
chromsizes = chromsizes
split_pattern = '___'

fragments_df_dict = {}
for sample_id in fragments_dict.keys():
	fragments_df = read_fragments_from_file(
		fragments_dict[sample_id],
		use_polars=True
	).df
	fragments_df.Start = np.int32(fragments_df.Start)
	fragments_df.End = np.int32(fragments_df.End)
	if "Score" in fragments_df:
		fragments_df.Score = np.int32(fragments_df.Score)
	if "barcode" in cell_data:
		fragments_df = fragments_df.loc[
			fragments_df["Name"].isin(cell_data["barcode"].tolist())
		]
	else:
		fragments_df = fragments_df.loc[
			fragments_df["Name"].isin(
				prepare_tag_cells(cell_data.index.tolist(), split_pattern)
			)
		]
	fragments_df_dict[sample_id] = fragments_df
if "barcode" in cell_data:
	cell_data = cell_data.loc[:, [variable, sample_id_col, "barcode"]]
else:
	cell_data = cell_data.loc[:, [variable, sample_id_col]]
cell_data[variable] = cell_data[variable].replace(
	" ", "", regex=True)
cell_data[variable] = cell_data[variable].replace(
	"[^A-Za-z0-9]+", "_", regex=True)
groups = sorted(list(set(cell_data[variable])))
if isinstance(chromsizes, pd.DataFrame):
	chromsizes = chromsizes.loc[:, ["Chromosome", "Start", "End"]]
	chromsizes = pr.PyRanges(chromsizes)

from pycisTopic.pseudobulk_peak_calling import *
export_pseudobulk_one_sample(
	cell_data = cell_data,
	group = "Agrp",
	fragments_df_dict = fragments_df_dict,
	chromsizes = chromsizes,
	bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),
	bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),
	sample_id_col = sample_id_col,
	normalize_bigwig = True,
	remove_duplicates = True,
	split_pattern = split_pattern
)

Error output

2023-12-27 12:09:59,794 cisTopic     INFO     Creating pseudobulk for Agrp

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_18655/1944494090.py in ?()
----> 2 from pycisTopic.pseudobulk_peak_calling import *
      3 export_pseudobulk_one_sample(
      4         cell_data = cell_data,
      5         group = "Agrp",

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py in ?(cell_data, group, fragments_df_dict, chromsizes, bigwig_path, bed_path, sample_id_col, normalize_bigwig, remove_duplicates, split_pattern)
    281     group_pr = pr.PyRanges(group_fragments)
    282     if isinstance(bigwig_path, str):
    283         bigwig_path_group = os.path.join(bigwig_path, str(group) + ".bw")
    284         if remove_duplicates:
--> 285             group_pr.to_bigwig(
    286                 path=bigwig_path_group,
    287                 chromosome_sizes=chromsizes,
    288                 rpm=normalize_bigwig,

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/pyranges_main.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun, chain)
   5502 
   5503         if chromosome_sizes is None:
   5504             chromosome_sizes = pr.data.chromsizes()
   5505 
-> 5506         result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
   5507 
   5508         if dryrun:
   5509             return result

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py in ?(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
    208             new_pyrles[k] = v
    209 
    210         gr = c.defragment().to_ranges()
    211 
--> 212     unique_chromosomes = gr.chromosomes
    213 
    214     subset = ["Chromosome", "Start", "End", "Score"]
    215 

~/.conda/envs/scenicplus/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
   5903             and name not in self._accessors
   5904             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5905         ):
   5906             return self[name]
-> 5907         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'chromosomes'

fragments_df_dict

{'POMC.2.Multiome':           Chromosome     Start       End                Name  Score
 0         GL456210.1       674       721  TGGTAAACACAAAGAC-1      2
 1         GL456210.1       762       840  CATGCAAGTGCTGTAA-1      2
 3         GL456210.1      1237      1314  GCTATAGGTGACATGC-1      7
 7         GL456210.1      3973      4269  TCCTCAATCGTTACTT-1      1
 10        GL456210.1      5799      6046  CTAGCTTGTGCATTAG-1      1
 ...              ...       ...       ...                 ...    ...
 46402681        chrY  90813548  90813613  TCTAGCGAGGCTACTG-1      3
 46402682        chrY  90813552  90813627  CCTAAATCAGGCCAAA-1      5
 46402687        chrY  90813565  90813613  CCTAAATCAGGCCAAA-1      1
 46402689        chrY  90813598  90813759  CTAATTGAGTAGCTTA-1      4
 46402699        chrY  90813819  90813894  TCCATGCTCCTAAGAC-1      3
 
 [10729616 rows x 5 columns],
 'POMC.3.Multiome':            Chromosome     Start       End                Name  Score
 0          GL456210.1       777       826  CTTGTTCCATGGTTAT-1      1
 15         GL456210.1      4239      4389  GTCTAACAGTAAGAAC-1      1
 17         GL456210.1      4319      4696  ATTGCGCCAGCACGTT-1      1
 19         GL456210.1      4483      4675  AATAGAGGTTAGCGTA-1      4
 35         GL456210.1      5295      5340  TTGTTGCGTTTGCTGT-1      1
 ...               ...       ...       ...                 ...    ...
 389415055        chrY  90829719  90829877  ACTCGCTTCTTGCAGG-1      4
 389415058        chrY  90829766  90829877  GCTGACCAGGATCACT-1      3
 389415059        chrY  90829808  90829871  GGTGCTTCAGAGAGCC-1      2
 389415060        chrY  90829808  90829882  GAACCTGTCGATTCTT-1      6
 389415064        chrY  90832009  90832046  AGTAACGAGGGTGAAC-1      1
 
 [148888034 rows x 5 columns],
 'POMC.4.Multiome':            Chromosome     Start       End                Name  Score
 5          GL456210.1       788       840  GAACGAATCAATCTCT-1      2
 6          GL456210.1       796       840  AGGTCAAAGCTGGCTA-1      1
 8          GL456210.1      1229      1530  CATTGTGCAGCAATAA-1      1
 11         GL456210.1      3546      3894  TTTCTTGCATCCCTCA-1      1
 17         GL456210.1      4573      4702  GTGCAAGCATTGTGCA-1      2
 ...               ...       ...       ...                 ...    ...
 207573434        chrY  90829569  90829755  TAAGCCAGTGGATGTC-1      2
 207573435        chrY  90829572  90829741  ATATAGGCAAGCTTTG-1      1
 207573441        chrY  90829611  90829762  GATTACTCAACAACAA-1      1
 207573442        chrY  90829613  90829762  GATTACTCAACAACAA-1      1
 207573451        chrY  90836714  90837044  CATGCGCAGTAGGATG-1      1
 
 [73288379 rows x 5 columns],
 'POMC.5.Multiome':            Chromosome     Start       End                Name  Score
 5          GL456210.1       634      1461  GTCCTCAGTGGATGTC-1      2
 12         GL456210.1       671       756  AATTTCCTCCAGGAAA-1      2
 17         GL456210.1       713       929  AACAAAGGTTGTAACG-1      3
 21         GL456210.1       766       926  GACCTGCAGTAGCGGG-1      3
 29         GL456210.1       788       830  GCTGTACCAACACCTA-1      3
 ...               ...       ...       ...                 ...    ...
 224486712        chrY  90836633  90836956  CATCACACAGTTTCTC-1      1
 224486714        chrY  90836640  90836776  GGTATTTCAACTCGCG-1      3
 224486715        chrY  90836649  90836764  GAGGTACAGGGATGAC-1      3
 224486716        chrY  90836651  90836764  GAGGTACAGGGATGAC-1      1
 224486719        chrY  90836676  90836828  TTTGAGTCACCAACCG-1      1
 
 [87360539 rows x 5 columns]}

It looks like pd.DataFrame does not match to Dict[str, pd.DataFrame] or pr.PyRanges. But how? Can you help me to sovle the issue? Many thanks!

-Li

Version (please complete the following information):

  • Python: [Python 3.10]
  • SCENIC+: [1.0.1.dev6+ge5ba6fc]
  • pycisTopic : [1.0.3.dev21+ge9b0e1a]

lililab-uwyo avatar Dec 27 '23 21:12 lililab-uwyo

I had this same error. I was able to get around this by changing the pyranges/out.py file that you have here: /home2/s186179/.conda/envs/scenicplus/lib/python3.8/site-packages/pyranges/out.py.

In the _to_bigwig() function I changed:

if not divide:
    gr = self.to_rle(rpm=rpm, strand=False, value_col=value_col).to_ranges()

to

if not divide:
    gr = self

ethanfenton avatar Jan 03 '24 03:01 ethanfenton

Hi @LILI-0000-0002-8173-7367 and @ethanfenton

Thank you for the bug report.

We are currently in the process of rewriting this function, see this branch https://github.com/aertslab/pycisTopic/tree/improve_create_pseudobulk.

The code in this branch should already be functional (and fully compatible with downstream functions), although not optimal yet.

After optimising and testing the code some more we will merge it in the main branch, but feel free to already use it.

All the best,

Seppe

SeppeDeWinter avatar Jan 03 '24 08:01 SeppeDeWinter

Hi Seppe,Thanks for all your work. After replacing the files in the original .lib/pycisTopic folder with those from https://github.com/aertslab/pycisTopic/tree/improve_create_pseudobulk, when I run the export_pseudobulk function, it prompts me that the fragments.tsv.gz.tbi file is no found. However, this file is not present in my data. As a beginner, I am not sure how to obtain the .tbi file. Thanks!

LMBei3353 avatar Jan 08 '24 02:01 LMBei3353

Hi LMBei3353,

It should be in the same folder as your atac_fragment.tsv file. The paths may look something like this: ...Count/Sample1/outs/atac_fragments.tsv.gz ...Count/Sample1/outs/atac_fragments.tsv.gz.tbi

Hope you find it! Abigail

Abigail575 avatar Jan 08 '24 17:01 Abigail575

Hello, @SeppeDeWinter.

I can run export_pseudobulk with the improved code. Thanks again for your help.

Li

lililab-uwyo avatar Jan 08 '24 17:01 lililab-uwyo

@ethanfenton

Thanks for your input.

lililab-uwyo avatar Jan 08 '24 17:01 lililab-uwyo

@Abigail575 Thanks for your help.I successfully compressed the 'fragments.tsv.gz' file into a '.tbi' file using the tabix tool, and it ran without any issues.

LMBei3353 avatar Jan 10 '24 01:01 LMBei3353