bambu icon indicating copy to clipboard operation
bambu copied to clipboard

Not working for UCSC fasta and annotation

Open alexyfyf opened this issue 2 years ago • 1 comments

Hi team,

thanks for developing this wonderful tool. I just want to confirm if it is designed not to support UCSC genome and annotation files? I found if I use hg38.fa and TxDb.Hsapiens.UCSC.hg38.knownGene, the bambu function gave me an error. But if I supplied it with Homo_sapiens.GRCh38.dna.primary_assembly.fa and Homo_sapiens.GRCh38.109.gtf, it run without error. In both case, the BAM files were generated using the corresponding fasta with the same parameter of minimap2, eg

minimap2 -ax splice $fasta SGNex_MCF7_cDNA_replicate1_run3/SGNex_MCF7_cDNA_replicate1_run3.fastq.gz -t 24 -I 1000G | samtools view -@ 24 -bSh - -o mcf7_rep1_run3_align.bam
samtools sort -@ 24 mcf7_rep1_run3_align.bam -o mcf7_rep1_run3_sorted.bam
samtools index -@ 24 mcf7_rep1_run3_sorted.bam

So if I run with hg38

test.bam <- 'mcf7_rep1_run3_sorted.bam'
fa.file <- '/home/users/allstaff/yan.a/yan.a/ref/hg38/hg38.fa'
system.time(bambuAnnotations <- keepStandardChromosomes(TxDb.Hsapiens.UCSC.hg38.knownGene) %>% prepareAnnotations)
system.time(se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file,
            ncore = 1, quant = F, verbose = T, NDR=1,
            rcOutDir = 'rcout'))

I got the error

Error: BiocParallel errors
  1 remote errors, element index: 1
  0 unevaluated and other errors
  first remote error:
Error in relist(unlistData, partitioning): shape of 'skeleton' is not compatible with 'NROW(flesh)'

But if I run with ENSEMBL fasta

test.bam <- 'mcf7_rep1_run3_align_GRCh38.bam'
fa.file <- '/home/users/allstaff/yan.a/yan.a/ref/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa'
gtf.file <- '/home/users/allstaff/yan.a/yan.a/ref/GRCh38/Homo_sapiens.GRCh38.109.gtf'
bambuAnnotations <- prepareAnnotations(gtf.file)
system.time(se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file,
                        ncore = 1, quant = F, verbose = T, NDR=1,
                        rcOutDir = 'rcout'))

I finish with only some warning. Could you double check it? I hope it would support UCSC because then we can seamlessly integrate with the genomedb and txdb in bioconductor.

Thanks a lot. Cheers, Alex

alexyfyf avatar Mar 06 '23 01:03 alexyfyf

Hi Alex,

Thanks for using Bambu. I did some digging and I realized that after using prepareAnnotations on TxDb.Hsapiens.UCSC.hg38.knownGene, 16% of the transcripts that do not have a gene id assigned. Bambu needs all transcripts to have a gene id to be able to function. I do not yet know if this is because of the annotations themselves or due to the bambu prepareAnnotations function and I will investigate this further.

However as a temporary measure so that you can use UCSC you can remove these annotations bambuAnnotations = bambuAnnotations[which(!is.na(mcols(bambuAnnotations)$GENEID))] Alternatively you could assign your own arbitrary gene ids to these to keep them in the annotations mcols(bambuAnnotations[which(is.na(mcols(bambuAnnotations)$GENEID))])$GENEID = seq_len(sum(is.na(mcols(bambuAnnotations)$GENEID)))

Hope this helps for now,

Kind Regards, Andre Sim

andredsim avatar Mar 06 '23 03:03 andredsim